An engineering team is optimizing a large-scale text generation service that processes long user prompts by breaking them into sequential segments. The team observes that while the service can handle a high volume of concurrent requests (high throughput), individual users complain about a noticeable delay before the first word of a response appears (high latency). The processing time for each segment is currently much longer than the time required to generate a single output word. Which of the following actions is the most effective first step to address the high latency issue?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineering team is optimizing a large-scale text generation service that processes long user prompts by breaking them into sequential segments. The team observes that while the service can handle a high volume of concurrent requests (high throughput), individual users complain about a noticeable delay before the first word of a response appears (high latency). The processing time for each segment is currently much longer than the time required to generate a single output word. Which of the following actions is the most effective first step to address the high latency issue?
Inference Service Performance Tuning
Performance Tuning for Sequential Input Processing