Learn Before
Comparison of Prefilling and Decoding Phases
The prefilling and decoding phases of Large Language Model inference differ significantly across several dimensions. While prefilling aims to establish the initial context from the input sequence, decoding focuses on continuing to generate subsequent tokens. In prefilling, tokens are visible all at once and processed in parallel to build an encoded contextual representation. In contrast, decoding operates with sequential visibility, predicting one token at a time using the previously cached key-value pairs. Consequently, prefilling is typically a compute-bound process with a high computational cost, whereas decoding is memory-bound and incurs a very high computational cost as the sequence grows.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Diagram of the Decoding Phase
Single-Step Generation with a KV Cache
Comparison of Prefilling and Decoding Phases
Disaggregation of Prefilling and Decoding using Pipelined Engines
After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt plus all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generating each new token?
A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.
Evaluating an Inference Optimization Proposal
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Phase Goal Formula
Learn After
Analyzing Language Model Inference Performance
A user provides a large 2,000-token text to a generative language model and asks for a summary. Which statement best describes how the model initially handles this 2,000-token input before it starts generating the summary?
Match each phase of the language model inference process with its primary computational characteristic.