1Cademy - An engineer is monitoring a text generation inference server that groups incoming requests into batches. They observe that while the time-to-completion for any single request within a running batch is very fast, the servers overall throughput (requests processed per hour) is low, with significant periods of hardware idleness. What is the most likely cause of this performance profile?

Learn Before

Decoding-Prioritized Strategy in Standard Batching

Multiple Choice

An engineer is monitoring a text generation inference server that groups incoming requests into batches. They observe that while the time-to-completion for any single request within a running batch is very fast, the server's overall throughput (requests processed per hour) is low, with significant periods of hardware idleness. What is the most likely cause of this performance profile?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related