Formula

Autoregressive Individual Attention Head Computation

During text generation, the output of the jj-th attention head at step ii is computed by applying the Query-Key-Value (QKV) attention function to its specific feature sub-space. This operation utilizes the current token's query vector, qi[j]\mathbf{q}_{i}^{[j]}, along with the cached key and value matrices for all tokens up to step ii, denoted as Ki[j]\mathbf{K}_{\le i}^{[j]} and Vi[j]\mathbf{V}_{\le i}^{[j]}. By projecting these representations onto the jj-th sub-space, the model can be interpreted as performing attention on a group of independent feature sub-spaces in parallel. The calculation is formalized as: headj=Attqkv(qi[j],Ki[j],Vi[j])\mathrm{head}_j = \mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_{i}^{[j]}, \mathbf{K}_{\le i}^{[j]}, \mathbf{V}_{\le i}^{[j]}).

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After