Learn Before
Iteration-Based Scheduling in LLM Inference
Iteration-based scheduling is an advanced strategy where the scheduler interacts with the inference engine at every single token prediction step, rather than waiting for an entire sequence to finish. This fine-grained approach permits dynamic adjustments to the active batch during execution. For example, if a critical or urgent request arrives, the scheduler can immediately insert it into the ongoing batch, ensuring it is processed without delay.
0
1
Tags
Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Scheduler-Driven Batch Adjustments Between Iterations in Continuous Batching
An LLM inference system is receiving a high volume of requests. In its queue are several short, low-priority requests and one long, high-priority request. To maximize overall system efficiency, what is the most probable action the scheduler component will take?
Diagnosing LLM Inference System Performance Issues
Analyzing Scheduler Trade-offs in LLM Inference
Request-Level Scheduling in LLM Inference
Iteration-Based Scheduling in LLM Inference