Activity (Process)

Queueing Requests in Continuous Batching

In continuous batching, if new user requests arrive when the inference engine is operating at full capacity, the scheduler does not add them to the active batch immediately. Instead, these requests are placed in a queue and must wait until resources are freed up, for instance, after an existing sequence in the batch completes its generation.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences