Activity (Process)

Chunked Prefilling

Chunked prefilling is a technique that improves decoding efficiency by overlapping the prefilling of one sequence with the decoding of another. It achieves this by dividing long input sequences into smaller segments, or 'chunks,' and processing each in a separate forward pass to incrementally build the KV cache. This approach allows the system to better balance long prefilling tasks with shorter decoding tasks, reducing decoder idle time and improving overall throughput. However, this method introduces significant trade-offs, including increased memory overhead from maintaining intermediate KV caches, compromised parallelism compared to a single-pass prefill, and greater scheduling complexity.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences