Comparison

Throughput-Latency Trade-off in LLM Inference

In large-scale, multi-user LLM serving systems, there is a fundamental conflict between maximizing system throughput and minimizing per-request latency. Strategies aimed at increasing throughput, such as batching multiple requests to process more tokens simultaneously, inherently increase the waiting time and thus the latency for individual users, particularly for short, interactive queries. Conversely, optimizing for low latency by serving requests individually or in small batches results in the underutilization of hardware resources and a reduction in overall system throughput. The ideal balance between these competing goals is contingent on the specific quality-of-service (QoS) requirements and user interaction patterns of the application.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related