1Cademy - Sequence Concatenation in Disaggregated Inference

Learn Before

Batching in LLM Inference

Activity (Process)

Sequence Concatenation in Disaggregated Inference

When prefilling and decoding are disaggregated during Large Language Model inference, multiple short sequences can be concatenated into a single long sequence for joint processing. This strategy maximizes the number of tokens processed in a batch, improving the throughput of the prefilling phase while minimizing the need for padding tokens. As a trade-off, this approach introduces additional communication overhead because the Key-Value (KV) cache must be transferred to the decoding devices. Consequently, achieving optimal performance with this method typically requires a high-bandwidth, low-latency network.

Updated 2026-05-05

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn Before

Related