An inference server is managing a batch of several short, ongoing requests that are in the process of generating output. A new request with a very long input sequence arrives. The system's scheduler immediately incorporates this new request into the active batch to begin processing it, aiming to keep the hardware as busy as possible. What is the most probable consequence for the initial short requests already in the batch?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Throughput-Latency Trade-off in Prefilling-Prioritized Continuous Batching
An inference server is managing a batch of several short, ongoing requests that are in the process of generating output. A new request with a very long input sequence arrives. The system's scheduler immediately incorporates this new request into the active batch to begin processing it, aiming to keep the hardware as busy as possible. What is the most probable consequence for the initial short requests already in the batch?
LLM Inference Server Performance Analysis
Evaluating Scheduling Strategies for Real-Time Applications