Learn Before
  • Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching

Latency Variability as a Drawback of Continuous Batching

While prioritizing prefilling is effective for maximizing hardware utilization, it introduces a critical trade-off: significant variability in token generation latency. This latency inconsistency becomes especially pronounced in systems that handle a mixed workload of both long and short input sequences, as shorter requests can be delayed by longer ones.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Latency Variability as a Drawback of Continuous Batching

  • Chunked Prefilling

  • Example of Decoder Idle Time in Standard Prefilling

  • An inference server for a large language model uses a continuous batching scheduler designed to maximize hardware utilization by immediately adding new requests to the processing queue. System administrators notice that while the overall token generation rate is high, users submitting short, conversational queries experience significant and unpredictable delays. These delays are most pronounced when the server is simultaneously handling requests to summarize long documents. What is the most likely cause of the high latency for the short queries?

  • LLM Inference Performance Analysis

  • Analyzing Performance Trade-offs in LLM Serving

Learn After
  • An engineering team is analyzing the performance of a new LLM inference server that uses a system to group incoming requests for efficient processing. They observe that the server's hardware is consistently busy, indicating high throughput. However, user feedback is negative, with many complaining that response times are extremely unpredictable; a short question might get an answer instantly one moment, but a similar short question might take many seconds the next. The server handles a mix of long document analysis requests and short conversational queries. What is the most probable explanation for this high variability in response time for short queries?

  • Evaluating an LLM Serving Strategy for Different Use Cases

  • Diagnosing Performance Issues in an LLM Serving System