1Cademy - Autoregressive Individual Attention Head Computation

Learn Before

Individual Attention Head Computation (General Vector Form)

Formula

Autoregressive Individual Attention Head Computation

During text generation, the output of the $j$ -th attention head at step $i$ is computed by applying the Query-Key-Value (QKV) attention function to its specific feature sub-space. This operation utilizes the current token's query vector, $\mathbf{q}_{i}^{[j]}$ , along with the cached key and value matrices for all tokens up to step $i$ , denoted as $\mathbf{K}_{\le i}^{[j]}$ and $\mathbf{V}_{\le i}^{[j]}$ . By projecting these representations onto the $j$ -th sub-space, the model can be interpreted as performing attention on a group of independent feature sub-spaces in parallel. The calculation is formalized as: $\mathrm{head}_j = \mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_{i}^{[j]}, \mathbf{K}_{\le i}^{[j]}, \mathbf{V}_{\le i}^{[j]})$ .

0

1

Updated 2026-04-23

Contributors are:

Who are from:

References

Learn After

QKV Attention Sharing Mechanisms

Learn Before

Related

Learn After