1Cademy - Example of Pipelined Prefilling and Decoding with Two Engines

Learn Before

Disaggregation of Prefilling and Decoding using Pipelined Engines

Example

Example of Pipelined Prefilling and Decoding with Two Engines

This diagram illustrates a pipelined architecture for LLM inference that uses two separate engines to improve efficiency. Engine 1 is dedicated to the prefilling phase, processing an initial batch of requests (e.g., sequences 1-4). Once the prefilling is complete, the resulting Key-Value (KV) cache is transferred to Engine 2, which then handles the decoding phase for that batch. The key advantage of this disaggregated approach is that Engine 1 can immediately begin prefilling a new batch of requests (e.g., sequences 5-6) while Engine 2 is concurrently decoding the first batch, thus overlapping computations and maximizing hardware utilization.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related