Short Answer

Deconstructing the GQA Formula with Causal Masking

The formula $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$ describes the output of an attention head. Analyze the distinct roles of the ${\le i}$ subscript and the $g(j)$ mapping function. How do these two components respectively influence the model's capabilities and computational performance?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science