Latency in Batched vs. Single Sequence Processing
Imagine two separate requests are sent to a large language model. Request A contains only a single, short sentence to be completed. Request B is a batch containing two items: the same short sentence from Request A, and a much longer paragraph that also needs to be completed. Explain why the user who sent Request A will receive their completed sentence back faster than the user who sent Request B, even though the same short sentence was processed in both cases.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is using a large language model for two different tasks. Task A requires generating a response to a user's query as quickly as possible to maintain a conversational flow. Task B involves processing a large collection of documents where the total time to complete all documents is the main concern, but the time for any single document is less critical. To achieve the fastest possible response time for an individual query in Task A, which processing approach should be used and why?
Latency in Batched vs. Single Sequence Processing
When a system processes a single input sequence at a time, the latency for that request is minimized because there is no added delay from waiting for other sequences in a batch to complete their generation.