Learn Before
Inference Scheduling Trade-offs
An LLM inference system is currently generating responses for several interactive user chats. A new, large batch of requests for offline document analysis arrives. The system scheduler must decide whether to immediately start processing the initial prompts for the new batch or to wait until the current chat responses are fully generated. Explain the likely impact on overall system throughput and the response time for the chat users if the scheduler chooses to immediately process the new batch.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Prefilling-Prioritized Strategy in Continuous Batching
Decoding-Prioritized Strategy in Standard Batching
Custom Priority Policies in LLM Scheduling
Inference Scheduling Trade-offs
An AI company operates a service that uses a large language model to summarize vast archives of legal documents. The primary business goal is to maximize the total number of documents summarized each day. The system receives a constant stream of new summarization requests. Given this primary goal, which scheduling approach for managing inference tasks would be most effective?
Optimizing a Hybrid LLM Service