Multiple Choice

An inference serving system for a large language model must handle requests from two user tiers: 'Premium' users who pay for guaranteed low latency, and 'Standard' users. The system also runs internal, non-urgent 'Analytics' jobs that can tolerate high latency. The primary business goal is to retain Premium users by meeting their low-latency expectations, while still processing requests from other tiers. Which custom scheduling policy would be the most effective for achieving this business goal?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science