Concept

Intuition Behind Overlapping Prefilling and Decoding

The intuition behind overlapping prefilling and decoding in continuous batching is to reduce idle times for both computation and data transfer by exploiting their different hardware bottlenecks. Because prefilling is generally a compute-bound process and decoding is a memory-bound process, a system can process a prefilling mini-batch to keep the GPUs fully occupied while simultaneously processing a decoding mini-batch to perform memory transfers concurrently.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences