Analyzing Performance Trade-offs in LLM Serving
An LLM inference system uses a scheduling strategy that prioritizes starting the computation for new, incoming requests to keep the hardware as busy as possible. If a very long new request (e.g., summarizing a large document) is added to a batch that also contains several shorter requests already in the process of generating their output, explain the mechanism by which the shorter requests experience an increase in their token-by-token generation time.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Latency Variability as a Drawback of Continuous Batching
Chunked Prefilling
Example of Decoder Idle Time in Standard Prefilling
An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?
LLM Inference Performance Analysis
Analyzing Performance Trade-offs in LLM Serving