Activity (Process)

Grouping User Requests by Sequence Length

To reduce the number of padding tokens and improve device utilization during Large Language Model inference, incoming user requests collected over a short period can be grouped into buckets based on their sequence lengths. By filling a batch exclusively with sequences from the same bucket, the system ensures that the batched sequences have similar lengths, thereby minimizing wasted computational resources.

0

1

Updated 2026-05-05

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences