Formula

Causal Attention Output for a Single Head and Token

In a causal multi-head attention mechanism, the output for a single head j at a specific token position i is computed using the standard Query-Key-Value (QKV) attention function. This calculation is restricted to the current and preceding tokens to maintain the autoregressive property. The formula is: headj=Attqkv(qi[j],Ki[j],Vi[j])\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\leq i}^{[j]}, \mathbf{V}_{\leq i}^{[j]}) Here, qi[j]\mathbf{q}_i^{[j]} is the query vector for the i-th token projected for head j, while Ki[j]\mathbf{K}_{\leq i}^{[j]} and Vi[j]\mathbf{V}_{\leq i}^{[j]} are the key and value matrices for head j, containing information from tokens 0 up to i.

Image 0

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences