Example

Example of Pipelined Prefilling and Decoding with Two Engines

This diagram illustrates a pipelined architecture for LLM inference that uses two separate engines to improve efficiency. Engine 1 is dedicated to the prefilling phase, processing an initial batch of requests (e.g., sequences 1-4). Once the prefilling is complete, the resulting Key-Value (KV) cache is transferred to Engine 2, which then handles the decoding phase for that batch. The key advantage of this disaggregated approach is that Engine 1 can immediately begin prefilling a new batch of requests (e.g., sequences 5-6) while Engine 2 is concurrently decoding the first batch, thus overlapping computations and maximizing hardware utilization.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences