Prefilling Phase in Transformer Inference
The prefilling phase is the initial stage of Transformer inference where the model processes the input sequence, denoted as x, to compute and populate the Key-Value (KV) cache. This stage is named 'prefilling' because its primary function is to prepare and store the key-value vector pairs for every token in the input prompt before the generative decoding process begins.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Prefilling Phase in Transformer Inference
Computational Cost Comparison: Decoding vs. Prefilling
Decoding Phase in Transformer Inference
Analysis of KV Cache Utilization in Autoregressive Generation
In an autoregressive Transformer model, generating a sequence in response to an input prompt involves two distinct phases from the perspective of the Key-Value (KV) cache. Which option correctly distinguishes the computational activities of these two phases?
An autoregressive language model receives an input prompt and generates a response. From the perspective of how it uses its internal memory for past context (the Key-Value cache), arrange the following high-level stages of the generation process in the correct chronological order.
Prefilling Phase in Transformer Inference
A user provides the following sequence of words to a large language model: 'Write a short story about a robot who discovers music.' In the model's text generation process, what is the primary role of this initial sequence of words?
Diagnosing Inference Latency
The Role of the Initial Input Sequence
Learn After
Formula for KV Cache Prefilling
Prefix Caching for LLM Inference
Prefilling as an Encoding Process
Disaggregation of Prefilling and Decoding using Pipelined Engines
Prefilling in One Go (Standard Prefilling)
A large language model is given a 1000-token document to process before it begins generating a new, multi-token response. Which statement best analyzes the fundamental computational difference between how the model processes the initial 1000-token document versus how it will subsequently generate each new token for its response?
LLM Inference Performance Analysis
Parallel Self-Attention in the Prefilling Phase
The Role and Output of the Prefilling Phase
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Network for KV Cache Generation