Optimizing LLM Serving Configuration
Analyze the two deployment scenarios described below. For each scenario, recommend whether to use a larger or smaller request batch size to optimize performance. Justify your recommendations by explaining the resulting trade-offs between overall processing efficiency and the time it takes to get a response for a single request.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Optimizing LLM Serving Configuration
An engineering team is deploying a large language model to power a real-time, interactive customer service chatbot. The top priority is ensuring that users experience minimal delay between sending a message and receiving a response. Which batch size strategy should the team implement to best achieve this goal?
Example of Throughput Gain with Increased Batch Size
Example of Minimal Latency with a Single Sequence
Match each performance characteristic of a language model serving system with the batch size strategy that is its primary cause.