1Cademy - Causal Attention Output for a Single Token

Learn Before

Causal Attention Weight Matrix Calculation

Formula

Causal Attention Output for a Single Token

In autoregressive language models, next tokens are predicted based solely on their preceding context (the 'left-context'). Accordingly, the output of the attention mechanism for a single token at position $i$ is calculated using only information from tokens ${}0$ to $i$ . This output is formulated as the product of the attention weight row vector for token $i$ and the matrix of corresponding value vectors up to that position:

$\mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_i,\mathbf{K}_{\le i},\mathbf{V}_{\le i}) = \begin{bmatrix} \alpha_{i,0} & \dots & \alpha_{i,i} \end{bmatrix} \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix}$

This matrix multiplication is equivalent to the weighted sum of the value vectors:

$\mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_i,\mathbf{K}_{\le i},\mathbf{V}_{\le i}) = \sum_{j = 0}^{i} \alpha_{i,j} \mathbf{v}_{j}$

In these equations, the keys and values up to position $i$ are respectively defined as the matrices $\mathbf{K}_{\le i} = \begin{bmatrix} \mathbf{k}_0 \\ \vdots \\ \mathbf{k}_{i} \end{bmatrix}$ and $\mathbf{V}_{\le i} = \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix}$ .

0

1

Updated 2026-04-22

Contributors are:

Who are from:

References

Learn Before

Related

Learn After