Learn Before
Case Study

LLM Serving System Design Trade-offs

A financial services company wants to deploy a large language model to power two distinct applications:

  1. A real-time, customer-facing chatbot that must provide instant answers to user queries.
  2. An internal batch processing tool that analyzes thousands of financial reports overnight to generate summaries for the next business day.

The engineering team proposes a single, unified serving system architecture that heavily utilizes dynamic batching to maximize hardware utilization and reduce operational costs. In this system, incoming requests are grouped together to be processed simultaneously, which improves overall throughput but can add a small delay for individual requests as they wait for a batch to be formed.

Based on your understanding of the components and complexities of a model serving system, critique the team's proposal to use this single architecture for both applications. Is this approach optimal? Justify your conclusion by evaluating the suitability of the proposed architecture for each specific use case.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science