Learn Before
An operations team monitors an LLM inference system and notices that the hardware responsible for model execution is consistently underutilized, even when there is a continuous stream of user requests waiting to be processed. This leads to lower-than-expected overall system throughput. In a standard workflow where requests are grouped into batches by a scheduler before being processed, what is the most probable explanation for this specific performance issue?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An operations team monitors an LLM inference system and notices that the hardware responsible for model execution is consistently underutilized, even when there is a continuous stream of user requests waiting to be processed. This leads to lower-than-expected overall system throughput. In a standard workflow where requests are grouped into batches by a scheduler before being processed, what is the most probable explanation for this specific performance issue?
Arrange the following stages of a typical request processing workflow in a Large Language Model (LLM) inference system into the correct chronological order, from the initial arrival of a request to the final output.
Diagram of the LLM Inference Workflow
LLM Inference Scheduling Strategy