Space Complexity of the KV Cache
During inference, the space complexity of the Key-Value (KV) cache is directly proportional to the number of tokens for which keys and values are stored. This relationship is captured by the formula , where is the number of layers, is the number of attention heads, is the head dimension, and is the number of tokens being cached.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Space Complexity of the KV Cache
Updating the KV Cache
Two-Phase Inference from a KV Cache Perspective
Single-Step Generation with a KV Cache
Memory Allocation for KV Caching in Standard Self-Attention
Multi-Dimensional Structure of the KV Cache
An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.
Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?
Chatbot Performance Degradation
Computational Steps in Cached Inference
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You run an internal LLM inference service for empl...
Your company’s internal LLM service handles many c...
You operate a GPU-backed LLM service that uses con...
You’re on-call for an internal LLM chat service. M...