1Cademy - Causal Attention Output for a Single Head and Token

Learn Before

Individual Attention Head Computation (General Vector Form)

Formula

Causal Attention Output for a Single Head and Token

In a causal multi-head attention mechanism, the output for a single head j at a specific token position i is computed using the standard Query-Key-Value (QKV) attention function. This calculation is restricted to the current and preceding tokens to maintain the autoregressive property. The formula is: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\leq i}^{[j]}, \mathbf{V}_{\leq i}^{[j]})$ Here, $\mathbf{q}_i^{[j]}$ is the query vector for the i-th token projected for head j, while $\mathbf{K}_{\leq i}^{[j]}$ and $\mathbf{V}_{\leq i}^{[j]}$ are the key and value matrices for head j, containing information from tokens 0 up to i.

0

1

Updated 2026-04-23

Contributors are:

Who are from:

References

Learn Before

Related

Learn After