Case Study

Decoding policy decision for a multilingual support assistant under safety, latency, and verbosity constraints

You own the decoding configuration for an internal multilingual customer-support assistant that drafts replies agents can send with minimal edits. The business constraints are: (1) replies must be deterministic for the same ticket to support auditability, (2) average generation latency must stay under 250 ms, (3) replies must not be overly short (agents complain about missing key steps) but also must not ramble, and (4) the current model sometimes produces a safe but generic first sentence and then either ends too early or repeats itself.

You are given two candidate configurations to choose from for the next release:

Config A:

  • Greedy decoding
  • No length penalty

Config B:

  • Beam search with beam width B=4
  • Apply a length penalty that discourages very short completions

A third option is proposed by another team:

Config C:

  • Top-p (nucleus) sampling with p=0.9
  • Temperature-scaled softmax with β=1.3
  • No length penalty

Case study task: Choose the single best configuration (A, B, or C) for this release and justify your choice by explicitly explaining how your selected approach handles (i) determinism vs diversity, (ii) the risk of premature stopping vs overly long outputs (including the role of length penalty), and (iii) latency/compute tradeoffs. Your justification must reference at least two concrete failure modes described above and explain why the other two configurations are less suitable under the stated constraints.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related