An LLM inference system is handling two sequences simultaneously using iteration-level scheduling with chunked prefilling. Sequence A has a long prompt that is broken into three prefill chunks (P₁, P₂, P₃). Sequence B is already in the middle of generating its response, requiring individual decode steps (D₁, D₂, D₃). Arrange the following computational steps into the most efficient order that demonstrates this scheduling strategy, ensuring that neither sequence is unnecessarily blocked.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A large language model inference system is processing two user requests concurrently. Request 1 has a very long initial prompt that requires significant initial computation. Request 2 is already in the process of generating a response, producing one token at a time. The system's scheduler operates by breaking the initial computation for Request 1 into three smaller chunks. It processes the first chunk of Request 1, then generates one token for Request 2, then processes the second chunk of Request 1, then generates another token for Request 2, and so on. What is the primary advantage of this interleaved processing strategy?
An LLM inference system is handling two sequences simultaneously using iteration-level scheduling with chunked prefilling. Sequence A has a long prompt that is broken into three prefill chunks (P₁, P₂, P₃). Sequence B is already in the middle of generating its response, requiring individual decode steps (D₁, D₂, D₃). Arrange the following computational steps into the most efficient order that demonstrates this scheduling strategy, ensuring that neither sequence is unnecessarily blocked.
Scheduling Strategy Evaluation for Hardware Upgrade