Example of the First Decoding Step in Continuous Batching (Iteration 2)
This diagram illustrates the second iteration in a continuous batching process, which follows the initial prefilling of requests x1 and x2. In this step, the scheduler directs the inference engine to perform a single decoding operation for the entire batch. This concurrently generates the first output token for both request x1 and request x2, demonstrating how the system transitions from the prefilling phase to the iterative decoding phase for a group of requests.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of the First Decoding Step in Continuous Batching (Iteration 2)
An inference server's scheduler receives two new, independent user requests at the same time. Assuming the system has the capacity to handle both, what is the most accurate description of the scheduler's immediate action and the primary goal of this initial processing step?
Initial Batch Formation
An inference scheduler receives two new, independent requests. Arrange the following events to accurately describe the initial processing step for these requests.
Learn After
Example of the Second Decoding Step in Continuous Batching (Iteration 3)
An inference engine is processing user requests. It has just completed the initial processing for two separate requests, 'Request A' and 'Request B', loading them into memory. Both requests are now ready for the next stage of generation. What is the most likely immediate next action the engine will take to operate efficiently, and what will be its result?
State of Batched Requests After First Generation Step
Inference Engine State Analysis