1Cademy - Attention Weight with Relative Positional Encoding

Learn Before

Causal Attention Mechanism

Formula

Attention Weight with Relative Positional Encoding

The attention weight $\alpha(i, j)$ in a causal attention mechanism can be calculated by incorporating relative positional information directly into the attention score. The formula is: $\alpha(i, j) = \text{Softmax}\left(\frac{\mathbf{q}_i \mathbf{k}_j^\top + \text{PE}(i, j)}{\sqrt{d}} + \text{Mask}(i, j)\right)$ Here, the score is based on the dot product of the query vector $\mathbf{q}_i$ and the key vector $\mathbf{k}_j$ , scaled by the square root of the dimension $d$ . A relative positional encoding term, $\text{PE}(i, j)$ , is added to this score to inject information about the relative distance between positions i and j. The $\text{Mask}(i, j)$ term is used to enforce causality by preventing attention to future positions (where $j > i$ ).

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After