Optimizing Chatbot Latency
A development team is trying to reduce the response time of their AI chatbot, which is powered by a large, high-quality language model. They decide to use a technique where a smaller, faster 'draft' model generates several candidate words at once, which are then checked for correctness by the main large model.
They are evaluating two draft models:
- Draft Model A: Extremely fast, but its suggestions are often incorrect, leading to a low acceptance rate by the main model.
- Draft Model B: Slower than Model A, but its suggestions are more aligned with the main model, resulting in a high acceptance rate.
Which draft model should the team choose to achieve the lowest overall response time? Justify your decision by explaining the critical trade-off involved.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is optimizing a text generation system that uses a large, powerful model for final output. To speed up the process, they are testing two different smaller 'draft' models to propose sequences of tokens for the large model to verify.
- Draft Model X: Generates 5 candidate tokens in 10ms. On average, the large model accepts only 1 of these 5 tokens.
- Draft Model Y: Generates 5 candidate tokens in 20ms. On average, the large model accepts 4 of these 5 tokens.
Assuming the verification step by the large model takes a constant amount of time regardless of which draft model is used, which statement best analyzes the likely overall performance of the system?
Optimizing Chatbot Latency
Draft Model Selection Rationale