Case Study

Release-readiness decision: decoding configuration for a customer-facing summarization feature

You are the on-call ML lead for a B2B product that generates 1–2 sentence “Account Update” summaries from internal CRM notes. A pilot with 200 users shows two failure modes:

  1. Some summaries are overly short (often 6–10 words) and omit key facts.
  2. Other summaries are fluent but occasionally include a plausible-sounding detail that is not in the notes.

Constraints: p95 latency must stay under 250 ms; the product team wants consistent tone across runs for the same input, but they also want to avoid repetitive phrasing across different accounts. You can change only decoding-time settings (no retraining, no prompt changes).

You must choose ONE decoding approach to ship this week and specify concrete settings for: (a) the core decoding algorithm (greedy, beam search, top-k sampling, or top-p sampling), (b) whether/how you will use temperature scaling, and (c) whether/how you will apply a length penalty. In your answer, justify how your choices jointly address both failure modes and the stated constraints, including at least one tradeoff you are accepting.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related