Example

Example of Chunked Prefilling in Iteration-Level Scheduling

In an iteration-level scheduling system, chunked prefilling can efficiently process a batch containing multiple sequences by overlapping prefilling and decoding steps. For instance, consider a batch with two sequences. Standard scheduling treats the entire prefilling of the first sequence as a single iteration, forcing the second sequence's decoding step (e.g., D22D_{22}) to wait until the entire prefill is complete. In contrast, chunked prefilling divides the first sequence's prefilling into smaller steps, such as chunks P11P_{11}, P12P_{12}, and P13P_{13}. Because each chunk corresponds to one iteration, decoding steps for the second sequence can execute concurrently with these prefilling chunks (e.g., D22D_{22} can execute during P12P_{12}). This significantly reduces decoder idle time and allows output tokens to be generated earlier.

Image 0

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related