Attention Head Output with Grouped Queries and Causal Masking
This formula calculates the output for a single attention head, head_j, in a transformer model that implements Grouped-Query Attention (GQA) with causal masking. The formula is: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$. In this equation, $\text{Att}_{\text{qkv}}$ represents the attention function, and $\mathbf{q}_i^{[j]}$ is the query vector for the current token i and head j. The keys $\mathbf{K}$ and values $\mathbf{V}$ are shared among a group of query heads, as determined by the function $g(j)$. The subscript ${\le i}$ signifies that the attention is causal, meaning it only considers tokens up to the current position i.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Attention Head Output with Grouped Queries and Causal Masking
Attention Head Output in Grouped-Query Attention (GQA)
A computational model processes sequences and, at a specific step
i, maintains a collection of data represented as:In this set, each is a pair of matrices, the subscript indicates that the matrices contain information for all sequence positions from the start up to position
i, and the superscript[t]is an index ranging from 1 toτ.Based on this structure, which statement provides the most accurate analysis of the collection?
Interpreting a Set of Indexed Key-Value Pairs
State of Key-Value Cache During Generation
Attention Head Output with Grouped Queries and Causal Masking
Attention Head Output in Grouped-Query Attention (GQA)
GQA as an Interpolation Between MHA and MQA
An engineering team is designing a large language model for a real-time translation application on a smartphone. The key constraints are low latency (fast response time) and a small memory footprint. However, maintaining high translation quality is also crucial. The team is debating the architecture of the model's attention layers. Which of the following approaches represents the most effective trade-off for this specific use case?
An attention layer in a transformer model is configured with 32 query heads. These query heads are organized into 8 distinct groups, where all heads within a single group share the same key and value projections. Based on this configuration, how many unique key/value projection pairs are used in this layer?
An architect is designing a new transformer model and is considering different configurations for the attention mechanism. Match each Grouped-Query Attention (GQA) configuration to the specific attention behavior it produces.
You’re leading an LLM platform team that must supp...
You’re debugging an LLM inference service that mus...
Your team is deploying a chat-based LLM that must ...
Selecting an Attention Design for Long-Context, Low-Latency Inference
Diagnosing and Redesigning Attention for a Long-Context, Cost-Constrained LLM Service
Choosing an Attention Stack for a Regulated, Long-Document Review Assistant
You’re reviewing a design doc for a Transformer at...
Attention Redesign for a Long-Context Customer-Support Copilot Under GPU Memory Pressure
Attention Architecture Choice for On-Device Meeting Summarization with 60k Context
Attention Redesign for a Multi-Tenant LLM with Long Context and Strict KV-Cache Budgets
Sets of Keys and Values in Grouped-Query Attention (GQA)
KV Cache Size in Grouped-Query Attention (GQA)
Learn After
A transformer model calculates the output for a single attention head
jat token positioniusing the formula:$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$, whereg(j)maps the query headjto a specific group. What is the primary consequence of using the group-indexed key$\mathbf{K}_{\le i}^{[g(j)]}$instead of a head-specific key$\mathbf{K}_{\le i}^{[j]}$?Applying Grouped-Query Attention with Causal Masking
Deconstructing the GQA Formula with Causal Masking