Example of Decoder Idle Time in Standard Prefilling
This diagram illustrates a key inefficiency in standard 'prefill in one go' batching. Sequence 2, with a short prompt, completes its prefill (P₂₁) and first decoding step (D₂₁) in Iteration 1. It then enters a prolonged 'Idle Time' during Iteration 2, as it must wait for the much longer prefilling of Sequence 1 (P₁₁) to complete. Only after this long prefill finishes can both sequences proceed with decoding in parallel from Iteration 3 onwards. This idle period demonstrates how long prefill tasks can block shorter decoding tasks, leading to underutilization of hardware and increased latency for some sequences in the batch.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Latency Variability as a Drawback of Continuous Batching
Chunked Prefilling
Example of Decoder Idle Time in Standard Prefilling
An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?
LLM Inference Performance Analysis
Analyzing Performance Trade-offs in LLM Serving
Learn After
A language model processes a batch containing two sequences: Sequence A with a long prompt and Sequence B with a short prompt. The system is configured to complete the entire prompt-processing (prefill) phase for all sequences in the batch before starting the parallel token-generation (decode) phase for the entire batch. Which statement best analyzes the primary source of computational inefficiency in this scenario?
Analyzing Hardware Utilization in Batched Inference
Explaining Inefficiency in Batched Processing