Learn Before
Causation

Improved Throughput and Reduced Latency with Chunked Prefilling

By processing input sequences in smaller chunks, chunked prefilling ensures that the computation time for prefilling and decoding operations within the same iteration is more comparable across different sequences. This balancing prevents decoding tasks from being stalled by lengthy prefilling operations, which reduces decoder idle time and consequently improves the overall system throughput.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences