Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching
While a prefilling-prioritized strategy in continuous batching is designed to maximize throughput and hardware utilization by processing prefill and decode steps concurrently, it introduces a significant latency trade-off. When a long input sequence is being processed, its prefilling stage can dominate the computational resources of an iteration. This forces shorter sequences to wait for their decoding steps, thereby increasing their token generation latency and creating high variability in performance, especially in environments with a mix of long and short requests.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching
An inference server is managing a batch of several short, ongoing requests that are in the process of generating output. A new request with a very long input sequence arrives. The system's scheduler immediately incorporates this new request into the active batch to begin processing it, aiming to keep the hardware as busy as possible. What is the most probable consequence for the initial short requests already in the batch?
LLM Inference Server Performance Analysis
Evaluating Scheduling Strategies for Real-Time Applications
Learn After
Latency Variability as a Drawback of Continuous Batching
Chunked Prefilling
Example of Decoder Idle Time in Standard Prefilling
An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?
LLM Inference Performance Analysis
Analyzing Performance Trade-offs in LLM Serving