Learn Before
An inference server processes user requests in groups. The server's scheduling policy dictates that it must wait for every single request within a group to finish generating its full response before it can begin processing the next group of requests. If a group contains three requests that take 4 seconds, 7 seconds, and 12 seconds to complete respectively, when will the server become available to start processing a new group?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Decoding-Prioritized Strategy in Standard Batching
Comparison of Continuous (Prefilling-Prioritized) vs. Standard (Decoding-Prioritized) Batching
An inference server processes user requests in groups. The server's scheduling policy dictates that it must wait for every single request within a group to finish generating its full response before it can begin processing the next group of requests. If a group contains three requests that take 4 seconds, 7 seconds, and 12 seconds to complete respectively, when will the server become available to start processing a new group?
Diagnosing Inference Server Performance Issues
Analyzing Static Batching Inefficiency