Learn Before
KV Cache Size in Multi-Query Attention
In Multi-Query Attention (MQA), keys and values are shared across all attention heads rather than being duplicated for each head. Because of this sharing, the memory footprint of the Key-Value (KV) cache is significantly reduced compared to standard multi-head attention. The size of the KV cache in MQA is given by the complexity formula , reflecting the removal of the head count multiplier.
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Individual Attention Head Formula in Multi-Query Attention (MQA)
Attention Mechanism Efficiency Analysis
In an effort to optimize an attention-based model, a researcher modifies the standard multi-head attention mechanism. The new design shares a single Key (K) and Value (V) projection across all attention heads, while each head continues to use its own unique Query (Q) projection. Which statement best analyzes the primary trade-off of this architectural change?
Structural Comparison of Attention Mechanisms
You’re leading an LLM platform team that must supp...
You’re debugging an LLM inference service that mus...
Your team is deploying a chat-based LLM that must ...
Selecting an Attention Design for Long-Context, Low-Latency Inference
Diagnosing and Redesigning Attention for a Long-Context, Cost-Constrained LLM Service
Choosing an Attention Stack for a Regulated, Long-Document Review Assistant
You’re reviewing a design doc for a Transformer at...
Attention Redesign for a Long-Context Customer-Support Copilot Under GPU Memory Pressure
Attention Architecture Choice for On-Device Meeting Summarization with 60k Context
Attention Redesign for a Multi-Tenant LLM with Long Context and Strict KV-Cache Budgets
KV Cache Size in Multi-Query Attention