Grouping User Requests by Sequence Length
To reduce the number of padding tokens and improve device utilization during Large Language Model inference, incoming user requests collected over a short period can be grouped into buckets based on their sequence lengths. By filling a batch exclusively with sequences from the same bucket, the system ensures that the batched sequences have similar lengths, thereby minimizing wasted computational resources.
0
1
Tags
Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of Efficient Batching with Similar Sequence Lengths
An engineer is processing a large dataset where text sequences vary in length from 5 tokens to 500 tokens. The engineer creates batches by randomly selecting sequences from the entire dataset. Which statement best evaluates the impact of this strategy on computational efficiency?
Optimizing Batch Processing for a Summarization Service
A machine learning model is processing text data. The efficiency of this process depends on how sequences are grouped into batches for computation. Evaluate the following three batches, each containing three sequences with the specified lengths, and match each batch to its relative computational efficiency.
Grouping User Requests by Sequence Length