1Cademy - Balancing Throughput and Latency via Chunk Size in Chunked Prefilling

Learn Before

Chunked Prefilling

Concept

Balancing Throughput and Latency via Chunk Size in Chunked Prefilling

The effectiveness of chunked prefilling can be fine-tuned by adjusting the size of the chunks. The goal is to select a chunk size that makes the processing time for a prefill chunk comparable to that of a single decoding step. By aligning these computational durations within the same iteration, the system can achieve a better balance between maximizing overall throughput and minimizing the token generation latency for individual requests.

Updated 2026-05-06

Contributors are: