1Cademy - KV Caching for Reducing Redundant Computation

Approach A: For each new word generated, the model re-processes the entire conversation history from scratch.
Approach B: The model stores key intermediate calculations from previous words in memory and reuses them to generate the next word.

Learn Before

Memory-Compute Trade-off in LLM Inference

Concept

KV Caching for Reducing Redundant Computation

The primary function of the KV cache in Transformer inference is to improve computational efficiency. By storing the attention states (keys and values) of previously processed tokens, the model avoids recomputing self-attention for these tokens in subsequent generation steps. This mechanism substantially reduces the compute time required for each new token.

Updated 2026-05-06

Contributors are: