A company deploys a large language model for a public-facing Q&A service, distributing inference requests across a cluster of identical GPUs. System monitoring reveals that overall GPU utilization is unexpectedly low, yet users experience highly variable and often slow response times, even for very short, simple questions. Which of the following is the most probable explanation for this specific combination of symptoms?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
LLM Inference System Performance Analysis
A company deploys a large language model for a public-facing Q&A service, distributing inference requests across a cluster of identical GPUs. System monitoring reveals that overall GPU utilization is unexpectedly low, yet users experience highly variable and often slow response times, even for very short, simple questions. Which of the following is the most probable explanation for this specific combination of symptoms?
Ineffectiveness of Static Load Balancing for Generative AI