1Cademy - Attention Head Output with Grouped Queries and Causal Masking

Learn Before

Formula

Attention Head Output with Grouped Queries and Causal Masking

This formula calculates the output for a single attention head, head_j, in a transformer model that implements Grouped-Query Attention (GQA) with causal masking. The formula is: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$ . In this equation, $\text{Att}_{\text{qkv}}$ represents the attention function, and $\mathbf{q}_i^{[j]}$ is the query vector for the current token i and head j. The keys $\mathbf{K}$ and values $\mathbf{V}$ are shared among a group of query heads, as determined by the function $g(j)$ . The subscript ${\le i}$ signifies that the attention is causal, meaning it only considers tokens up to the current position i.

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After