Concept

Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching

While a prefilling-prioritized strategy in continuous batching is designed to maximize throughput and hardware utilization by processing prefill and decode steps concurrently, it introduces a significant latency trade-off. When a long input sequence is being processed, its prefilling stage can dominate the computational resources of an iteration. This forces shorter sequences to wait for their decoding steps, thereby increasing their token generation latency and creating high variability in performance, especially in environments with a mix of long and short requests.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences