Learn Before
Simple Iteration-level Scheduling
Simple iteration-level scheduling is a strategy used in systems like continuous batching where decisions are made at each discrete computational step, or iteration. In any given iteration, the scheduler assigns a single task—such as one decoding step or one chunk of a prefill operation—to each sequence in the active batch. This method enables the fine-grained interleaving of different computational tasks, such as processing a new request's prefill concurrently with the decoding steps of ongoing requests.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Iteration in Continuous Batching
General Process of Continuous Batching
Example of Interleaving Prefilling and Decoding in Continuous Batching
Overhead of Dynamic Batch Reorganization in Continuous Batching
Memory Fragmentation in LLM Inference
Prefilling-Prioritized Strategy in Continuous Batching
Simple Iteration-level Scheduling
Priority-Based Scheduling in LLM Inference
Custom Priority Policies in LLM Scheduling
Disaggregation of Prefilling and Decoding using Pipelined Engines
Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching
LLM Inference Scheduling Strategy
An LLM inference server is processing a batch of three long-running requests. In the middle of this process, after several computational steps have already been completed for the initial batch, a new, short request arrives. How would a system implementing continuous batching most likely handle this new request in the next computational step?
An LLM inference system is designed to maximize hardware utilization. Which of the following operational descriptions best illustrates the core principle of continuous batching, distinguishing it from a static batching approach?
Learn After
Example of Chunked Prefilling in Iteration-Level Scheduling
An inference server for a large language model is handling two user requests at the same time. Request A requires a long, multi-step initial processing phase before it can generate its first word. Request B is already in its generation phase, producing one word at a time. The server employs a scheduling system that, in each computational cycle, assigns exactly one unit of work—either a single step of initial processing or the generation of a single word—to each active request. What is the most significant outcome of using this scheduling approach in this scenario?
An LLM inference server uses an iteration-level scheduler to process two requests concurrently. Request A requires an initial computation (prefill) that is broken into two chunks. Request B is in the process of generating its first two tokens (decoding). To ensure both requests make progress without one blocking the other, the scheduler interleaves these tasks. Arrange the four computational tasks below into the most logical and efficient sequence of operations over four iterations.
Evaluating an LLM Inference Scheduling Strategy