Short Answer

Inference Scheduling Trade-offs

An LLM inference system is currently generating responses for several interactive user chats. A new, large batch of requests for offline document analysis arrives. The system scheduler must decide whether to immediately start processing the initial prompts for the new batch or to wait until the current chat responses are fully generated. Explain the likely impact on overall system throughput and the response time for the chat users if the scheduler chooses to immediately process the new batch.

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science