Learn Before
An LLM inference server uses an iteration-level scheduler to process two requests concurrently. Request A requires an initial computation (prefill) that is broken into two chunks. Request B is in the process of generating its first two tokens (decoding). To ensure both requests make progress without one blocking the other, the scheduler interleaves these tasks. Arrange the four computational tasks below into the most logical and efficient sequence of operations over four iterations.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Example of Chunked Prefilling in Iteration-Level Scheduling
An inference server for a large language model is handling two user requests at the same time. Request A requires a long, multi-step initial processing phase before it can generate its first word. Request B is already in its generation phase, producing one word at a time. The server employs a scheduling system that, in each computational cycle, assigns exactly one unit of work—either a single step of initial processing or the generation of a single word—to each active request. What is the most significant outcome of using this scheduling approach in this scenario?
An LLM inference server uses an iteration-level scheduler to process two requests concurrently. Request A requires an initial computation (prefill) that is broken into two chunks. Request B is in the process of generating its first two tokens (decoding). To ensure both requests make progress without one blocking the other, the scheduler interleaves these tasks. Arrange the four computational tasks below into the most logical and efficient sequence of operations over four iterations.
Evaluating an LLM Inference Scheduling Strategy