Short Answer

Explaining Inefficiency in Batched Processing

Consider a batch of two sequences being processed by a language model. Sequence A has a very long prompt, and Sequence B has a very short prompt. The system uses a strategy where it must finish processing the entire prompt for all sequences in the batch before it can begin generating the second token for any sequence. Analyze why Sequence B will experience a significant delay before its second token is generated, even though its own prompt was processed quickly.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science