Case Study

Optimizing Chatbot Latency

A development team is trying to reduce the response time of their AI chatbot, which is powered by a large, high-quality language model. They decide to use a technique where a smaller, faster 'draft' model generates several candidate words at once, which are then checked for correctness by the main large model.

They are evaluating two draft models:

  • Draft Model A: Extremely fast, but its suggestions are often incorrect, leading to a low acceptance rate by the main model.
  • Draft Model B: Slower than Model A, but its suggestions are more aligned with the main model, resulting in a high acceptance rate.

Which draft model should the team choose to achieve the lowest overall response time? Justify your decision by explaining the critical trade-off involved.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science