Formula

Attention Weight with Relative Positional Encoding

The attention weight α(i,j)\alpha(i, j) in a causal attention mechanism can be calculated by incorporating relative positional information directly into the attention score. The formula is: α(i,j)=Softmax(qikj+PE(i,j)d+Mask(i,j))\alpha(i, j) = \text{Softmax}\left(\frac{\mathbf{q}_i \mathbf{k}_j^\top + \text{PE}(i, j)}{\sqrt{d}} + \text{Mask}(i, j)\right) Here, the score is based on the dot product of the query vector qi\mathbf{q}_i and the key vector kj\mathbf{k}_j, scaled by the square root of the dimension dd. A relative positional encoding term, PE(i,j)\text{PE}(i, j), is added to this score to inject information about the relative distance between positions i and j. The Mask(i,j)\text{Mask}(i, j) term is used to enforce causality by preventing attention to future positions (where j>ij > i).

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences