A transformer model calculates the output for a single attention head j at token position i using the formula: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$, where g(j) maps the query head j to a specific group. What is the primary consequence of using the group-indexed key $\mathbf{K}_{\le i}^{[g(j)]}$ instead of a head-specific key $\mathbf{K}_{\le i}^{[j]}$?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A transformer model calculates the output for a single attention head
jat token positioniusing the formula:$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$, whereg(j)maps the query headjto a specific group. What is the primary consequence of using the group-indexed key$\mathbf{K}_{\le i}^{[g(j)]}$instead of a head-specific key$\mathbf{K}_{\le i}^{[j]}$?Applying Grouped-Query Attention with Causal Masking
Deconstructing the GQA Formula with Causal Masking