In Multi-Query Attention (MQA), keys and values are shared across all attention heads rather than being duplicated for each head. Because of this sharing, the memory footprint of the Key-Value (KV) cache is significantly reduced compared to standard multi-head attention. The size of the KV cache in MQA is given by the complexity formula $$O(L \cdot d_h \cdot m)$$, reflecting the removal of the head count multiplier.

Google

Multi-Query Attention (MQA) is an architectural refinement of the standard multi-head attention model designed for greater efficiency by sharing keys and values across heads, while allowing queries to be unique for each head. In MQA, for a given step $$i$$, there is a single set of shared keys and values, denoted as $$(\mathbf{K}_{\le i}, \mathbf{V}_{\le i})$$. In contrast, there are $$\tau$$ distinct queries, denoted as $$\left\{\mathbf{q}_{i}^{[1]},\dots,\mathbf{q}_{i}^{[\tau]}\right\}$$, each corresponding to a different attention head. This allows different heads to learn unique focuses while being more computationally and memory efficient than standard multi-head attention.

Multi-Query Attention (MQA)

Reference of Foundations of Large Language Models Course

In Multi-Query Attention (MQA), the output for an individual head $$j$$ is calculated using its unique query vector, $$\mathbf{q}_i^{[j]}$$, while utilizing the Key and Value matrices, $$\mathbf{K}_{\le i}$$ and $$\mathbf{V}_{\le i}$$, which are shared across all heads. This is represented by the formula: 

$$\mathrm{head}_j = \mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_{i}^{[j]},\mathbf{K}_{\le i},\mathbf{V}_{\le i})$$

Individual Attention Head Formula in Multi-Query Attention (MQA)

Based on the scenario below, analyze the primary trade-off introduced by the proposed architectural change. Explain how the change impacts computational/memory efficiency and what potential drawback it might have on the model's capabilities.

Attention Mechanism Efficiency Analysis

In an effort to optimize an attention-based model, a researcher modifies the standard multi-head attention mechanism. The new design shares a single Key (K) and Value (V) projection across all attention heads, while each head continues to use its own unique Query (Q) projection. Which statement best analyzes the primary trade-off of this architectural change?

Describe the primary structural difference between standard multi-head attention and multi-query attention with respect to the Query (Q), Key (K), and Value (V) projections. How does this difference impact computational efficiency during inference?

Structural Comparison of Attention Mechanisms

You’re leading an LLM platform team that must supp...

You’re debugging an LLM inference service that mus...

Your team is deploying a chat-based LLM that must ...

You are leading an LLM deployment for an internal corporate assistant that must (a) answer questions over up to 64k tokens of context, (b) stream tokens with low latency on a fixed GPU memory budget, and (c) preserve answer quality on tasks that require both local detail (e.g., reading a paragraph) and occasional long-range retrieval (e.g., referencing a policy defined 40k tokens earlier). You can change only the attention mechanism.

Write a recommendation memo that chooses ONE primary attention design (dense scaled dot-product attention, sparse attention, linear attention, multi-query attention (MQA), or grouped-query attention (GQA)) and, if needed, ONE secondary modification from the remaining options to mitigate the primary design’s biggest weakness. Your memo must:
1) Explain how scaled dot-product attention’s Softmax(QK^T/sqrt(d))V structure drives both compute/memory costs and quality, and what changes in your chosen design(s) relative to that baseline.
2) Analyze the end-to-end inference implications for long-context streaming, explicitly addressing BOTH (i) attention computation cost and (ii) KV-cache memory footprint, and how MQA/GQA interact with those constraints.
3) Justify the expected quality impact for the stated workload, including at least one concrete failure mode you are trying to avoid (e.g., missing long-range dependencies, over-concentrated attention, or loss of head diversity) and how your design choice trades off expressiveness vs efficiency.

Assume the model is causal (cannot attend to future tokens) and that latency is dominated by attention and KV-cache reads/writes.

Selecting an Attention Design for Long-Context, Low-Latency Inference

You own an internal LLM-powered “policy copilot” service that must answer questions over very long documents (up to 64k tokens) with strict cost controls. In production you observe two issues: (1) GPU memory spikes during autoregressive generation because the key/value (KV) cache grows large, and (2) quality regressions on questions that require linking a detail from early in the document to a later section. You are allowed to change only the attention mechanism (not the tokenizer, training data, or number of layers).

Write a recommendation memo that proposes a concrete attention redesign using a combination of: (a) the scaled dot-product QKV attention formulation (including the role of scaling and masking), (b) either sparse attention or linear attention to address long-context efficiency, and (c) either Multi-Query Attention (MQA) or Grouped-Query Attention (GQA) to reduce KV-cache cost. Your memo must:

- Explain, using the Q/K/V computation and where the Softmax and mask apply, why the current dense scaled dot-product attention leads to the observed memory behavior at 64k tokens during generation.
- Justify your chosen efficiency approach (sparse vs linear) in terms of what it changes about the attention weight computation and what that implies for long-range dependency quality.
- Justify your chosen KV-sharing approach (MQA vs GQA) and explicitly discuss the trade-off between memory/latency and expressiveness/quality.
- Propose at least one mitigation for the long-range quality regression that is consistent with your chosen efficiency method (e.g., how you would ensure important early tokens remain attendable), and explain the expected side effects.

Assume the model is causal/autoregressive and must not attend to future tokens.

Diagnosing and Redesigning Attention for a Long-Context, Cost-Constrained LLM Service

You are leading model architecture decisions for an internal LLM that helps legal/compliance teams review very long contracts (up to 200k tokens) in a regulated environment. Two hard constraints: (1) the system must support streaming generation with a strict causal mask (no looking ahead), and (2) the serving budget is dominated by GPU memory, especially the KV cache, because many users keep long sessions open. A third requirement is quality: the model must reliably connect distant definitions and cross-references, not just local context.

Write a recommendation memo (as if to engineering leadership) that proposes a concrete attention design for inference-time decoding that combines at least two of the following ideas in a coherent way: scaled dot-product attention, sparse attention, linear attention, multi-query attention (MQA), and grouped-query attention (GQA). Your memo must:

- Explain how your design preserves the core Q–K–V weighted-sum behavior under a causal mask, and what changes (if any) you are making to the Softmax-based scaled dot-product formulation.
- Analyze the memory and latency implications, explicitly addressing both (a) the quadratic attention-matrix cost and (b) the KV-cache footprint during long-session decoding.
- Justify how your design maintains long-range retrieval quality (e.g., cross-references) despite efficiency changes, and identify at least one failure mode or trade-off you would monitor in production.

Assume you can change the attention implementation but not the overall product requirement of streaming, token-by-token generation.

Choosing an Attention Stack for a Regulated, Long-Document Review Assistant

You’re reviewing a design doc for a Transformer at...

You own inference performance for an internal customer-support copilot that must answer with citations from a running conversation + attached policy docs. In production, the model runs autoregressively with a KV cache and must support up to 64k tokens of context. After a traffic spike, you observe two issues: (1) GPU memory is the primary limiter (OOMs occur before compute saturates), and (2) quality regressions appear specifically when the answer depends on a few far-back policy passages rather than recent chat turns.

You are allowed to change ONLY the attention mechanism in the decoder blocks (no retrieval system changes, no extra memory modules). You can choose among: standard scaled dot-product attention, sparse attention (restricting each query to attend to a subset of past tokens), linear attention (kernel feature map with no softmax, enabling re-ordered multiplications), multi-query attention (shared K/V across all heads), or grouped-query attention (K/V shared within groups of heads).

Propose a single attention design (you may combine at most TWO of the listed techniques in the same layer, e.g., a KV-sharing variant plus a sparsity pattern), and justify it by explicitly explaining: (a) how your choice changes KV-cache memory growth and/or size compared with standard scaled dot-product attention, (b) how it affects the model’s ability to use a small number of long-range tokens (the far-back policy passages) during generation, and (c) one concrete trade-off or failure mode you would monitor after deployment.

Attention Redesign for a Long-Context Customer-Support Copilot Under GPU Memory Pressure

You are leading an applied research team shipping an on-device meeting summarization feature. The model must handle up to 60,000 tokens of context and generate summaries autoregressively. On the target device, the dominant bottleneck is inference-time memory (KV cache), and you have a strict cap that rules out storing a full per-head KV cache for all layers. However, product quality requirements include: (1) reliably capturing a few “global” decisions made early in the meeting, and (2) accurately summarizing local details from the last ~2,000 tokens. You can change only the attention mechanism (not the tokenizer, not the number of layers), and you must keep causal masking.

Propose a concrete attention design for the self-attention layers that combines (a) the core scaled dot-product attention idea, with (b) one long-context efficiency strategy (sparse attention or linear attention), and (c) one KV-sharing strategy (MQA or GQA). In your answer, justify how your design simultaneously addresses the KV-cache memory cap and the two quality requirements, and explicitly call out at least one trade-off or failure mode your design introduces compared with standard dense multi-head attention.

Attention Architecture Choice for On-Device Meeting Summarization with 60k Context

You own the inference architecture for a multi-tenant internal LLM used for two workloads: (A) interactive chat (typical prompt 2–8k tokens, strict latency SLO) and (B) long-document analysis (typical prompt 64–128k tokens, latency less strict but must fit on a single GPU). The current model uses standard scaled dot-product multi-head self-attention with a causal mask and a conventional KV cache. In production, you observe two issues: (1) GPU memory spikes linearly with context length and causes OOM for workload B, and (2) for workload A, throughput is limited by KV-cache bandwidth during decoding. You are allowed to change ONLY the attention mechanism (you may choose among sparse attention, linear attention, multi-query attention (MQA), grouped-query attention (GQA), or keep dense scaled dot-product attention), and you must justify the choice in terms of both compute/memory behavior and expected quality risks.

Case Study Prompt: Propose a single attention design (it can be one mechanism or a combination, e.g., “X + Y”) that you would deploy across both workloads, and explain why it best addresses BOTH observed issues. Your answer must explicitly connect: (i) how scaled dot-product attention’s softmax-based QK^T computation and masking relate to the bottlenecks, (ii) how your chosen mechanism(s) change the attention computation pattern (dense vs sparse vs kernelized/linearized) and/or KV sharing (MQA/GQA), and (iii) the main quality trade-off(s) you would monitor (e.g., loss of long-range dependencies, reduced head diversity) and why they arise from the mechanism.

Learn Before

Related