Learn Before
The Core Trade-off in LLM Serving
In the context of a system serving many users with a large language model, explain why a strategy designed to maximize the total number of requests processed per minute often results in a longer wait time for each individual user. Describe the core conflict between the two performance metrics involved.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Impact of Batch Size on the Throughput-Latency Trade-off
An engineering team is optimizing a system that serves a large language model to multiple users. To maximize the number of requests processed per hour, they decide to group incoming requests into large batches before sending them to the hardware for processing. This approach significantly increases the system's overall processing capacity. For which of the following applications would this optimization strategy be most detrimental to the user experience?
Optimizing LLM Serving for Different Applications
The Core Trade-off in LLM Serving