LLM Inference System Performance Analysis
A company deploys a text summarization service using a large language model on a cluster of four identical devices. They use a simple 'round-robin' scheduler, which sends each new request to the next device in a fixed cycle. The service accepts documents of widely varying lengths for summarization. System monitoring reveals that during peak usage, average response times are high, and device utilization is highly imbalanced—some devices are constantly busy while others are often idle. Based on this scenario, analyze the root cause of the performance issues.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
LLM Inference System Performance Analysis
A company deploys a large language model for a public-facing Q&A service, distributing inference requests across a cluster of identical GPUs. System monitoring reveals that overall GPU utilization is unexpectedly low, yet users experience highly variable and often slow response times, even for very short, simple questions. Which of the following is the most probable explanation for this specific combination of symptoms?
Ineffectiveness of Static Load Balancing for Generative AI