Learn Before
Sequence Concatenation in Disaggregated Inference
When prefilling and decoding are disaggregated during Large Language Model inference, multiple short sequences can be concatenated into a single long sequence for joint processing. This strategy maximizes the number of tokens processed in a batch, improving the throughput of the prefilling phase while minimizing the need for padding tokens. As a trade-off, this approach introduces additional communication overhead because the Key-Value (KV) cache must be transferred to the decoding devices. Consequently, achieving optimal performance with this method typically requires a high-bandwidth, low-latency network.
0
1
Tags
Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Aggregated Architecture for Prefilling and Decoding
Static Batching
A technology company is optimizing its popular chatbot service, which is powered by a large language model and handles thousands of simultaneous user queries. To manage this high load, their engineers implement a system that waits to collect several user queries and processes them together as a single group in one computational step. Which of the following outcomes is the most direct and significant advantage of this approach?
Analyzing LLM Serving Strategies
Efficiency of Sequential vs. Batched Processing
Throughput-Latency Trade-off in LLM Inference
Simultaneous Token Generation in Batched Decoding
Sequence Concatenation in Disaggregated Inference