1Cademy - Attention with Prior

Learn Before

Attention-level improvements of Transformers

Concept

Attention with Prior

$Prior$ is an attention distribution that comes from a different source than previous from inputs (e.g., softmax(QK $^T$ )in vanilla Transformer). Attention with prior is the fusion of two attention distributions, which can be done by computing a weighted sum of the scores corresponding to the prior and generated attention before applying softmax