Attention Head Output in Grouped-Query Attention (GQA)
The output computation for a specific attention head in a Grouped-Query Attention (GQA) model depends on its assigned key-value group. If represents the group ID for the -th head, the head's output is calculated using the formula: In this expression, the unique query vector for the current token attends to the keys and values that are shared within its respective group .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Attention Head Output with Grouped Queries and Causal Masking
Attention Head Output in Grouped-Query Attention (GQA)
A computational model processes sequences and, at a specific step
i, maintains a collection of data represented as:In this set, each is a pair of matrices, the subscript indicates that the matrices contain information for all sequence positions from the start up to position
i, and the superscript[t]is an index ranging from 1 toτ.Based on this structure, which statement provides the most accurate analysis of the collection?
Interpreting a Set of Indexed Key-Value Pairs
State of Key-Value Cache During Generation
Attention Head Output with Grouped Queries and Causal Masking
Attention Head Output in Grouped-Query Attention (GQA)
GQA as an Interpolation Between MHA and MQA
An engineering team is designing a large language model for a real-time translation application on a smartphone. The key constraints are low latency (fast response time) and a small memory footprint. However, maintaining high translation quality is also crucial. The team is debating the architecture of the model's attention layers. Which of the following approaches represents the most effective trade-off for this specific use case?
An attention layer in a transformer model is configured with 32 query heads. These query heads are organized into 8 distinct groups, where all heads within a single group share the same key and value projections. Based on this configuration, how many unique key/value projection pairs are used in this layer?
An architect is designing a new transformer model and is considering different configurations for the attention mechanism. Match each Grouped-Query Attention (GQA) configuration to the specific attention behavior it produces.
You’re leading an LLM platform team that must supp...
You’re debugging an LLM inference service that mus...
Your team is deploying a chat-based LLM that must ...
Selecting an Attention Design for Long-Context, Low-Latency Inference
Diagnosing and Redesigning Attention for a Long-Context, Cost-Constrained LLM Service
Choosing an Attention Stack for a Regulated, Long-Document Review Assistant
You’re reviewing a design doc for a Transformer at...
Attention Redesign for a Long-Context Customer-Support Copilot Under GPU Memory Pressure
Attention Architecture Choice for On-Device Meeting Summarization with 60k Context
Attention Redesign for a Multi-Tenant LLM with Long Context and Strict KV-Cache Budgets
Sets of Keys and Values in Grouped-Query Attention (GQA)
KV Cache Size in Grouped-Query Attention (GQA)
Learn After
In a specific attention mechanism, there are 8 query heads (indexed j=1 to 8) and 2 distinct Key-Value (KV) groups (indexed g=1 to 2). Query heads 1 through 4 are assigned to KV group 1, while query heads 5 through 8 are assigned to KV group 2. The output for a given query head
jis calculated based on its own query vectorq^[j]and the Key-Value pair from its assigned group,(K^[g(j)], V^[g(j)]). Which Key-Value pair will query head 6 use for its computation?In a grouped-query attention system with 12 query heads (indexed j=1 to 12), the function
g(j)maps a query headjto its corresponding key-value group. This mapping is defined by the formulag(j) = floor((j-1) / 3) + 1. Based on this, which of the following pairs of query heads will use the same set of Key and Value matrices for their attention computation?Consider an attention mechanism where the output for a head
jis computed by the formulahead_j = Att_qkv(q_i^[j], K_<=i^[g(j)], V_<=i^[g(j)]). In this setup,q_i^[j]is a query vector unique to headj, while the functiong(j)maps headjto a potentially shared key-value group.Statement: If two distinct query heads,
j1andj2, are mapped to the same key-value group (meaningg(j1) = g(j2)), their final output vectors,head_j1andhead_j2, will necessarily be identical.