Concept
Intuition Behind Overlapping Prefilling and Decoding
The intuition behind overlapping prefilling and decoding in continuous batching is to reduce idle times for both computation and data transfer by exploiting their different hardware bottlenecks. Because prefilling is generally a compute-bound process and decoding is a memory-bound process, a system can process a prefilling mini-batch to keep the GPUs fully occupied while simultaneously processing a decoding mini-batch to perform memory transfers concurrently.
0
1
Updated 2026-05-06
Tags
Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences