Essay

Post-incident analysis: fixing repetition and truncation by tuning decoding

You own an internal LLM feature that drafts customer-facing incident updates. After a model upgrade, stakeholders report two issues: (1) outputs are often prematurely short (missing key details), and (2) when you try to increase “creativity,” some drafts become repetitive or slightly incoherent. You are not allowed to change the model weights—only the decoding configuration.

Write a post-incident proposal that recommends a single decoding strategy (you may combine methods) and a tuning plan that explicitly connects: (a) the choice between greedy decoding vs beam search vs sampling, (b) how you would set and justify either top-k or top-p (nucleus) sampling, (c) how you would use temperature-scaled softmax in combination with your sampling choice, and (d) how you would apply a length penalty (or length normalization) so that longer, more complete updates are not unfairly disfavored.

Your answer must explain the tradeoffs and interactions among these controls (e.g., how temperature changes the effective candidate distribution before top-k/top-p truncation, and how length penalty changes which sequences win under beam search), and it must end with a concrete “default” configuration plus a brief rollback/monitoring plan (what metrics or failure modes you would watch for).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related