1Cademy - Explaining Inefficiency in Batched Processing

Learn Before

Example of Decoder Idle Time in Standard Prefilling

Short Answer

Explaining Inefficiency in Batched Processing

Consider a batch of two sequences being processed by a language model. Sequence A has a very long prompt, and Sequence B has a very short prompt. The system uses a strategy where it must finish processing the entire prompt for all sequences in the batch before it can begin generating the second token for any sequence. Analyze why Sequence B will experience a significant delay before its second token is generated, even though its own prompt was processed quickly.

Updated 2025-10-10

Contributors are:

Who are from:

Learn Before

Related