Latency Variability as a Drawback of Continuous Batching
While prioritizing prefilling is effective for maximizing hardware utilization, it introduces a critical trade-off: significant variability in token generation latency. This latency inconsistency becomes especially pronounced in systems that handle a mixed workload of both long and short input sequences, as shorter requests can be delayed by longer ones.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Latency Variability as a Drawback of Continuous Batching
Chunked Prefilling
Example of Decoder Idle Time in Standard Prefilling
An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?
LLM Inference Performance Analysis
Analyzing Performance Trade-offs in LLM Serving