Google

Continuous batching is an iteration-based scheduling method, notably used in the Orca system, where the composition of a request batch is dynamically adjusted between computational steps. This flexibility allows for new input sequences to be added or completed sequences to be removed from the batch during any iteration. This adjustment can occur even if the processing for the entire batch is not yet finished, distinguishing it from static methods.

Continuous Batching for LLM Inference

In continuous batching, an iteration represents a distinct step in computation, corresponding to either the full prefilling phase for a given input or a single token's decoding step. For instance, given an input sequence $$\mathbf{x}=x_0...x_m$$ and a target output sequence $$\mathbf{y}=y_1...y_n$$, processing requires a total of $$n+1$$ iterations. This includes one initial iteration to handle prefilling, followed by $$n$$ iterations to generate the output sequence, yielding one token per iteration.

Iteration in Continuous Batching

The continuous batching method follows a general, multi-step procedure. This process outlines the sequence of actions taken to dynamically manage request batches, from their initial creation and iterative adjustment to their eventual termination.

General Process of Continuous Batching

Continuous batching demonstrates its efficiency when a new request arrives while an existing batch is already undergoing decoding. For example, after an initial batch of requests (e.g., x1, x2, x3) has completed its prefilling and several decoding steps, a new request (x4) might arrive. The system can then, in the next computational iteration, perform the prefilling for the new request x4 while simultaneously executing another decoding step for the ongoing requests x1, x2, and x3. This concurrent execution of prefilling for new requests and decoding for existing ones is a key feature that maximizes hardware utilization and system throughput.

Example of Interleaving Prefilling and Decoding in Continuous Batching

A significant trade-off in continuous batching is the overhead associated with its dynamic nature. The scheduler must constantly reorganize batches by rearranging data in memory whenever requests are added or removed. This continuous reassessment and optimization of the batch structure incurs both computational and memory costs. These overheads can lead to negative consequences such as increased memory fragmentation and, in some situations, additional processing latency, which can counteract the throughput gains.

Overhead of Dynamic Batch Reorganization in Continuous Batching

During the process of generating text, language models continuously allocate and deallocate memory, particularly for the KV cache. This dynamic memory usage can lead to fragmentation, where the available memory is split into numerous small, non-contiguous blocks. The diagram visualizes this with interspersed used and free memory blocks. This fragmentation poses a significant challenge, as it can prevent the allocation of large, contiguous memory chunks needed for new or growing sequences, thereby reducing system efficiency.

Memory Fragmentation in LLM Inference

The prefilling-prioritized strategy is a core characteristic of continuous batching where the scheduler adds new requests to the active batch as soon as the inference engine has available resources. By processing these new requests for prefilling as early as possible, this approach is designed to maximize system throughput. However, this prioritization comes at the cost of increased latency for ongoing requests, as the prefilling of new, long inputs can extend the overall processing time for the entire batch.

Prefilling-Prioritized Strategy in Continuous Batching

Simple iteration-level scheduling is a strategy used in systems like continuous batching where decisions are made at each discrete computational step, or iteration. In any given iteration, the scheduler assigns a single task—such as one decoding step or one chunk of a prefill operation—to each sequence in the active batch. This method enables the fine-grained interleaving of different computational tasks, such as processing a new request's prefill concurrently with the decoding steps of ongoing requests.

Simple Iteration-level Scheduling

Priority-based scheduling is a general strategy for managing LLM inference by allocating system resources according to the designated importance of certain requests or computational steps. This approach aligns resource usage with specific performance goals. For instance, decoding steps can be prioritized to minimize token generation latency for individual requests, whereas prefilling steps can be prioritized to maximize overall system throughput in batch-processing scenarios.

Priority-Based Scheduling in LLM Inference

In practical applications, scheduling systems can be designed with custom priority policies that go beyond simple prefill/decode prioritization. These policies allow practitioners to account for specific operational needs and constraints, such as meeting request deadlines or giving precedence to requests based on user-defined importance levels.

Custom Priority Policies in LLM Scheduling

This strategy, known as the disaggregation of prefilling and decoding, implements continuous batching by using two specialized hardware engines. A dedicated 'Engine 1' performs prefilling for a batch of requests. Once complete, the generated Key-Value (KV) cache is sent to a separate 'Engine 2' for decoding. The primary benefit of this pipeline is that Engine 1 can immediately start prefilling the next batch while Engine 2 is decoding the first. This overlapping of computations is key to improving computational efficiency and maximizing hardware utilization.

Disaggregation of Prefilling and Decoding using Pipelined Engines

Continuous and standard batching strategies differ fundamentally in their prioritization, which leads to distinct performance trade-offs. Continuous batching employs a prefilling-prioritized approach, where new requests are added to the batch as soon as computational resources become available. This method maximizes system throughput and hardware utilization but can increase the processing latency for requests already in the batch. Conversely, standard batching is decoding-prioritized, meaning it processes an entire batch to completion before handling new requests. This ensures lower latency for the active batch but results in reduced device utilization and overall system throughput.

Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching

Based on the scenario, evaluate the primary performance trade-off of implementing this new dynamic scheduling method. Explain which type of request is prioritized and what the potential negative consequence is for the other type of request.

LLM Inference Scheduling Strategy

An LLM inference server is processing a batch of three long-running requests. In the middle of this process, after several computational steps have already been completed for the initial batch, a new, short request arrives. How would a system implementing continuous batching most likely handle this new request in the next computational step?

An LLM inference system is designed to maximize hardware utilization. Which of the following operational descriptions best illustrates the core principle of continuous batching, distinguishing it from a static batching approach?

Learn Before

Related