1Cademy - Static Batching

Learn Before

Batching in LLM Inference

Concept

Static Batching

Static batching is a scheduling strategy where once a batch of requests is sent for execution, its processing is uninterruptible. The scheduler must wait for the entire batch to be fully processed before it can assemble and dispatch the next one.

Updated 2025-10-07

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Decoding-Prioritized Strategy in Standard Batching
Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching
An inference server processes user requests in groups. The server's scheduling policy dictates that it must wait for every single request within a group to finish generating its full response before it can begin processing the next group of requests. If a group contains three requests that take 4 seconds, 7 seconds, and 12 seconds to complete respectively, when will the server become available to start processing a new group?
Diagnosing Inference Server Performance Issues
Analyzing Static Batching Inefficiency

Learn Before

Related

Learn After