Learn Before
Concept

Balancing Throughput and Latency via Chunk Size in Chunked Prefilling

The effectiveness of chunked prefilling can be fine-tuned by adjusting the size of the chunks. The goal is to select a chunk size that makes the processing time for a prefill chunk comparable to that of a single decoding step. By aligning these computational durations within the same iteration, the system can achieve a better balance between maximizing overall throughput and minimizing the token generation latency for individual requests.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences