Learn Before
Iteration in Continuous Batching
In continuous batching, an iteration represents a distinct step in computation, corresponding to either the full prefilling phase for a given input or a single token's decoding step. For instance, given an input sequence and a target output sequence , processing requires a total of iterations. This includes one initial iteration to handle prefilling, followed by iterations to generate the output sequence, yielding one token per iteration.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Iteration in Continuous Batching
General Process of Continuous Batching
Example of Interleaving Prefilling and Decoding in Continuous Batching
Overhead of Dynamic Batch Reorganization in Continuous Batching
Memory Fragmentation in LLM Inference
Prefilling-Prioritized Strategy in Continuous Batching
Simple Iteration-level Scheduling
Priority-Based Scheduling in LLM Inference
Custom Priority Policies in LLM Scheduling
Disaggregation of Prefilling and Decoding using Pipelined Engines
Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching
LLM Inference Scheduling Strategy
An LLM inference server is processing a batch of three long-running requests. In the middle of this process, after several computational steps have already been completed for the initial batch, a new, short request arrives. How would a system implementing continuous batching most likely handle this new request in the next computational step?
An LLM inference system is designed to maximize hardware utilization. Which of the following operational descriptions best illustrates the core principle of continuous batching, distinguishing it from a static batching approach?
Learn After
A large language model using a continuous batching inference system processes a single request. The input prompt consists of 150 tokens, and the model is configured to generate an output of 200 tokens. How many computational iterations are required to fully process this single request?
LLM Inference Request Processing
In a continuous batching system for large language model inference, every single token processed, whether from the input prompt or the generated output, constitutes one separate computational iteration.