Example of Minimal Latency with a Single Sequence
An illustrative case for understanding latency is processing a single input sequence. In this scenario, with a batch size of one, the result becomes available immediately after the generation is complete. There is no additional waiting time or computational overhead caused by other sequences, representing the lowest possible latency for a request.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Optimizing LLM Serving Configuration
An engineering team is deploying a large language model to power a real-time, interactive customer service chatbot. The top priority is ensuring that users experience minimal delay between sending a message and receiving a response. Which batch size strategy should the team implement to best achieve this goal?
Example of Throughput Gain with Increased Batch Size
Example of Minimal Latency with a Single Sequence
Match each performance characteristic of a language model serving system with the batch size strategy that is its primary cause.
Learn After
A development team is using a large language model for two different tasks. Task A requires generating a response to a user's query as quickly as possible to maintain a conversational flow. Task B involves processing a large collection of documents where the total time to complete all documents is the main concern, but the time for any single document is less critical. To achieve the fastest possible response time for an individual query in Task A, which processing approach should be used and why?
Latency in Batched vs. Single Sequence Processing
When a system processes a single input sequence at a time, the latency for that request is minimized because there is no added delay from waiting for other sequences in a batch to complete their generation.