Learn Before
LLM Serving System Design Trade-offs
A financial services company wants to deploy a large language model to power two distinct applications:
- A real-time, customer-facing chatbot that must provide instant answers to user queries.
- An internal batch processing tool that analyzes thousands of financial reports overnight to generate summaries for the next business day.
The engineering team proposes a single, unified serving system architecture that heavily utilizes dynamic batching to maximize hardware utilization and reduce operational costs. In this system, incoming requests are grouped together to be processed simultaneously, which improves overall throughput but can add a small delay for individual requests as they wait for a batch to be formed.
Based on your understanding of the components and complexities of a model serving system, critique the team's proposal to use this single architecture for both applications. Is this approach optimal? Justify your conclusion by evaluating the suitability of the proposed architecture for each specific use case.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Examples of Open-Source LLM Serving Systems
LLM Serving System Design Trade-offs
Deconstructing the Complexity of LLM Serving Systems
A team is building a high-quality serving system for a new large language model. Match each specific engineering challenge with the primary area of system complexity it represents.