Example of Pipelined Prefilling and Decoding with Two Engines
This diagram illustrates a pipelined architecture for LLM inference that uses two separate engines to improve efficiency. Engine 1 is dedicated to the prefilling phase, processing an initial batch of requests (e.g., sequences 1-4). Once the prefilling is complete, the resulting Key-Value (KV) cache is transferred to Engine 2, which then handles the decoding phase for that batch. The key advantage of this disaggregated approach is that Engine 1 can immediately begin prefilling a new batch of requests (e.g., sequences 5-6) while Engine 2 is concurrently decoding the first batch, thus overlapping computations and maximizing hardware utilization.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Related
An LLM inference system is designed with two specialized hardware engines operating in a pipeline. Engine A processes the initial prompts for a batch of user requests to generate their internal state. This state is then passed to Engine B, which handles the step-by-step generation of the response tokens for that same batch. As soon as Engine A finishes with the first batch, it immediately begins processing the initial prompts for a second, new batch of requests while Engine B is still generating tokens for the first batch. What is the primary computational advantage of this two-engine architecture?
Optimizing LLM Inference Throughput
Example of Pipelined Prefilling and Decoding with Two Engines
Pipelined Engine Efficiency