1Cademy - Diagnosing LLM Inference Server Inefficiency

Learn Before

Example of Interleaving Prefilling and Decoding in Continuous Batching

Case Study

Diagnosing LLM Inference Server Inefficiency

Based on the following case study, identify the most likely scheduling inefficiency in the server's logic and explain why it leads to the observed performance degradation.

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

An LLM inference server is processing a batch of three requests (A, B, C) and has just completed their initial, compute-intensive processing stage. At this moment, a new request (D) arrives. To maximize hardware utilization and overall system throughput, what is the most efficient action for the server to take in the very next iteration?
An LLM inference server that dynamically manages its workload is processing several requests. The following list describes the key events in this process. Arrange these events in the correct chronological order to reflect the most efficient operational flow.
Diagnosing LLM Inference Server Inefficiency

Learn Before

Related