Learn Before
  • Throughput-Latency Trade-off in LLM Inference

Impact of Batch Size on the Throughput-Latency Trade-off

The choice of batch size creates a direct trade-off between system throughput and latency. Smaller batch sizes result in lower latency because fewer tokens are processed in one inference pass. This approach, however, leads to underutilized parallel computing resources, such as GPUs remaining idle, which diminishes overall system throughput. In contrast, larger batch sizes maximize throughput by fully engaging the hardware's parallelism, allowing GPUs to be occupied with large-scale matrix computations. This efficiency gain comes with higher latency, as results are only available after the entire batch, including the final token, has been processed.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Impact of Batch Size on the Throughput-Latency Trade-off

  • An engineering team is optimizing a system that serves a large language model to multiple users. To maximize the number of requests processed per hour, they decide to group incoming requests into large batches before sending them to the hardware for processing. This approach significantly increases the system's overall processing capacity. For which of the following applications would this optimization strategy be most detrimental to the user experience?

  • Optimizing LLM Serving for Different Applications

  • The Core Trade-off in LLM Serving

Learn After
  • Optimizing LLM Serving Configuration

  • An engineering team is deploying a large language model to power a real-time, interactive customer service chatbot. The top priority is ensuring that users experience minimal delay between sending a message and receiving a response. Which batch size strategy should the team implement to best achieve this goal?

  • Example of Throughput Gain with Increased Batch Size

  • Example of Minimal Latency with a Single Sequence

  • Match each performance characteristic of a language model serving system with the batch size strategy that is its primary cause.