1Cademy - Chunked Prefilling

Learn Before

Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching

Activity (Process)

Chunked Prefilling

Chunked prefilling is a technique that improves decoding efficiency by overlapping the prefilling of one sequence with the decoding of another. It achieves this by dividing long input sequences into smaller segments, or 'chunks,' and processing each in a separate forward pass to incrementally build the KV cache. This approach allows the system to better balance long prefilling tasks with shorter decoding tasks, reducing decoder idle time and improving overall throughput. However, this method introduces significant trade-offs, including increased memory overhead from maintaining intermediate KV caches, compromised parallelism compared to a single-pass prefill, and greater scheduling complexity.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

References

Learn Before

Related

Learn After