Resource Reallocation in Dynamic Batching
An inference engine is processing a batch containing four active sequences (A, B, C, and D). After one processing iteration, sequence B generates its end-of-sequence token. Describe the two primary changes to the system's state that the scheduler will implement before the next iteration begins, and explain the direct benefit of these changes for overall system efficiency.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An inference engine is processing a group of three text generation requests simultaneously. After a few computational steps, two of the requests have finished generating their complete output, while the third, much longer request, is still in progress. To optimize overall system throughput, what is the most logical immediate next action for the engine's scheduler to take regarding this group of requests?
Analyzing Inference Engine Performance Logs
Resource Reallocation in Dynamic Batching