Learn Before
Activity (Process)

Sequence Concatenation in Disaggregated Inference

When prefilling and decoding are disaggregated during Large Language Model inference, multiple short sequences can be concatenated into a single long sequence for joint processing. This strategy maximizes the number of tokens processed in a batch, improving the throughput of the prefilling phase while minimizing the need for padding tokens. As a trade-off, this approach introduces additional communication overhead because the Key-Value (KV) cache must be transferred to the decoding devices. Consequently, achieving optimal performance with this method typically requires a high-bandwidth, low-latency network.

Image 0

0

1

Updated 2026-05-05

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences