Formula

Causal Attention Output for a Single Token

In autoregressive language models, next tokens are predicted based solely on their preceding context (the 'left-context'). Accordingly, the output of the attention mechanism for a single token at position ii is calculated using only information from tokens 0{}0 to ii. This output is formulated as the product of the attention weight row vector for token ii and the matrix of corresponding value vectors up to that position:

Attqkv(qi,Ki,Vi)=[αi,0αi,i][v0vi]\mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_i,\mathbf{K}_{\le i},\mathbf{V}_{\le i}) = \begin{bmatrix} \alpha_{i,0} & \dots & \alpha_{i,i} \end{bmatrix} \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix}

This matrix multiplication is equivalent to the weighted sum of the value vectors:

Attqkv(qi,Ki,Vi)=j=0iαi,jvj\mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_i,\mathbf{K}_{\le i},\mathbf{V}_{\le i}) = \sum_{j = 0}^{i} \alpha_{i,j} \mathbf{v}_{j}

In these equations, the keys and values up to position ii are respectively defined as the matrices Ki=[k0ki]\mathbf{K}_{\le i} = \begin{bmatrix} \mathbf{k}_0 \\ \vdots \\ \mathbf{k}_{i} \end{bmatrix} and Vi=[v0vi]\mathbf{V}_{\le i} = \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix}.

Image 0

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences