1Cademy - Decoding Phase in Transformer Inference

Learn Before

Two-Phase Inference from a KV Cache Perspective

Activity (Process)

Decoding Phase in Transformer Inference

Following the prefilling stage, the decoding phase utilizes the pre-computed key-value pairs stored in the KV cache to autoregressively generate subsequent tokens one by one.

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Diagram of the Decoding Phase
Comparison of Prefilling and Decoding Phases
Disaggregation of Prefilling and Decoding using Pipelined Engines
After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt plus all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generating each new token?
A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.
Evaluating an Inference Optimization Proposal
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Phase Goal Formula
Single-Step Autoregressive Generation with a Key-Value (KV) Cache

Learn Before

Related

Learn After