Example

Narrative Example of Dynamic Batch Management in Continuous Batching

This scenario illustrates how continuous batching dynamically manages sequences during inference, contrasting with standard request-level batching which fixes a batch of input sequences and processes them to completion. As illustrated, the system continuously accepts and adds new requests into the current batch as long as there is available compute capacity. Initially, two user requests, x1\mathbf{x}_1 and x2\mathbf{x}_2, are grouped into a batch and sent to the inference engine. After two iterations, a new request, x3\mathbf{x}_3, is received and incorporated into the active batch. The engine processes this updated batch concurrently, advancing the decoding process for x1\mathbf{x}_1 and x2\mathbf{x}_2 while executing the prefilling phase for x3\mathbf{x}_3. When x2\mathbf{x}_2 completes its generation, two additional requests, x4\mathbf{x}_4 and x5\mathbf{x}_5, arrive. The scheduler removes the finished x2\mathbf{x}_2 and adds x4\mathbf{x}_4 to the batch based on available capacity, while x5\mathbf{x}_5 is queued until resources free up.

Image 0

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences