Deconstructing the GQA Formula with Causal Masking
The formula $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$ describes the output of an attention head. Analyze the distinct roles of the ${\le i}$ subscript and the $g(j)$ mapping function. How do these two components respectively influence the model's capabilities and computational performance?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A transformer model calculates the output for a single attention head
jat token positioniusing the formula:$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$, whereg(j)maps the query headjto a specific group. What is the primary consequence of using the group-indexed key$\mathbf{K}_{\le i}^{[g(j)]}$instead of a head-specific key$\mathbf{K}_{\le i}^{[j]}$?Applying Grouped-Query Attention with Causal Masking
Deconstructing the GQA Formula with Causal Masking