Comparison

Impact of Batch Size on the Throughput-Latency Trade-off

The choice of batch size creates a direct trade-off between system throughput and latency. Smaller batch sizes result in lower latency because fewer tokens are processed in one inference pass. This approach, however, leads to underutilized parallel computing resources, such as GPUs remaining idle, which diminishes overall system throughput. In contrast, larger batch sizes maximize throughput by fully engaging the hardware's parallelism, allowing GPUs to be occupied with large-scale matrix computations. This efficiency gain comes with higher latency, as results are only available after the entire batch, including the final token, has been processed.

0

1

Updated 2026-05-05

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences