Case Study

Choosing a Decoding Configuration Under Latency, Diversity, and Length Constraints

You are deploying an internal LLM feature that drafts customer-facing email replies for account managers. Requirements: (1) average latency must stay under 250 ms, (2) outputs must not be overly “templated” across similar tickets (product wants noticeable variation), and (3) replies must be 90–130 words; current outputs are often ~50 words and sometimes end abruptly. You can change only decoding-time settings (no fine-tuning).

Propose ONE concrete decoding configuration (algorithm + key parameters) that best meets all three requirements, and justify it by explaining how your choices jointly affect (a) determinism vs diversity, (b) search/compute cost, and (c) output length. Your answer must explicitly address why you did NOT choose at least one plausible alternative (e.g., greedy, beam search, top-k, or top-p) given the constraints.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related