Impact of Batch Size on the Throughput-Latency Trade-off
The choice of batch size creates a direct trade-off between system throughput and latency. Smaller batch sizes result in lower latency because fewer tokens are processed in one inference pass. This approach, however, leads to underutilized parallel computing resources, such as GPUs remaining idle, which diminishes overall system throughput. In contrast, larger batch sizes maximize throughput by fully engaging the hardware's parallelism, allowing GPUs to be occupied with large-scale matrix computations. This efficiency gain comes with higher latency, as results are only available after the entire batch, including the final token, has been processed.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Impact of Batch Size on the Throughput-Latency Trade-off
An engineering team is optimizing a system that serves a large language model to multiple users. To maximize the number of requests processed per hour, they decide to group incoming requests into large batches before sending them to the hardware for processing. This approach significantly increases the system's overall processing capacity. For which of the following applications would this optimization strategy be most detrimental to the user experience?
Optimizing LLM Serving for Different Applications
The Core Trade-off in LLM Serving