Learn Before
An inference serving system for a large language model must handle requests from two user tiers: 'Premium' users who pay for guaranteed low latency, and 'Standard' users. The system also runs internal, non-urgent 'Analytics' jobs that can tolerate high latency. The primary business goal is to retain Premium users by meeting their low-latency expectations, while still processing requests from other tiers. Which custom scheduling policy would be the most effective for achieving this business goal?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating Scheduling Policies for a Multi-Tenant LLM Service
An inference serving system for a large language model must handle requests from two user tiers: 'Premium' users who pay for guaranteed low latency, and 'Standard' users. The system also runs internal, non-urgent 'Analytics' jobs that can tolerate high latency. The primary business goal is to retain Premium users by meeting their low-latency expectations, while still processing requests from other tiers. Which custom scheduling policy would be the most effective for achieving this business goal?
Analyzing Trade-offs in Deadline-Aware LLM Scheduling