Explaining Inefficiency in Batched Processing
Consider a batch of two sequences being processed by a language model. Sequence A has a very long prompt, and Sequence B has a very short prompt. The system uses a strategy where it must finish processing the entire prompt for all sequences in the batch before it can begin generating the second token for any sequence. Analyze why Sequence B will experience a significant delay before its second token is generated, even though its own prompt was processed quickly.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model processes a batch containing two sequences: Sequence A with a long prompt and Sequence B with a short prompt. The system is configured to complete the entire prompt-processing (prefill) phase for all sequences in the batch before starting the parallel token-generation (decode) phase for the entire batch. Which statement best analyzes the primary source of computational inefficiency in this scenario?
Analyzing Hardware Utilization in Batched Inference
Explaining Inefficiency in Batched Processing