Learn Before
Example of Interleaving Prefilling and Decoding in Continuous Batching
Continuous batching demonstrates its efficiency when a new request arrives while an existing batch is already undergoing decoding. For example, after an initial batch of requests (e.g., x1, x2, x3) has completed its prefilling and several decoding steps, a new request (x4) might arrive. The system can then, in the next computational iteration, perform the prefilling for the new request x4 while simultaneously executing another decoding step for the ongoing requests x1, x2, and x3. This concurrent execution of prefilling for new requests and decoding for existing ones is a key feature that maximizes hardware utilization and system throughput.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Iteration in Continuous Batching
General Process of Continuous Batching
Example of Interleaving Prefilling and Decoding in Continuous Batching
Overhead of Dynamic Batch Reorganization in Continuous Batching
Memory Fragmentation in LLM Inference
Prefilling-Prioritized Strategy in Continuous Batching
Simple Iteration-level Scheduling
Priority-Based Scheduling in LLM Inference
Custom Priority Policies in LLM Scheduling
Disaggregation of Prefilling and Decoding using Pipelined Engines
Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching
LLM Inference Scheduling Strategy
An LLM inference server is processing a batch of three long-running requests. In the middle of this process, after several computational steps have already been completed for the initial batch, a new, short request arrives. How would a system implementing continuous batching most likely handle this new request in the next computational step?
An LLM inference system is designed to maximize hardware utilization. Which of the following operational descriptions best illustrates the core principle of continuous batching, distinguishing it from a static batching approach?
Learn After
An LLM inference server is processing a batch of three requests (A, B, C) and has just completed their initial, compute-intensive processing stage. At this moment, a new request (D) arrives. To maximize hardware utilization and overall system throughput, what is the most efficient action for the server to take in the very next iteration?
An LLM inference server that dynamically manages its workload is processing several requests. The following list describes the key events in this process. Arrange these events in the correct chronological order to reflect the most efficient operational flow.
Diagnosing LLM Inference Server Inefficiency