1Cademy - An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?

Learn Before

Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching

Multiple Choice

An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related