1Cademy - Deconstructing the GQA Formula with Causal Masking

Learn Before

Attention Head Output with Grouped Queries and Causal Masking

Short Answer

Deconstructing the GQA Formula with Causal Masking

The formula $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$ describes the output of an attention head. Analyze the distinct roles of the ${\le i}$ subscript and the $g(j)$ mapping function. How do these two components respectively influence the model's capabilities and computational performance?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related