Imagine a text generation model is producing a sequence. At the first step, the most probable next word is 'apple' with a log-probability of -0.8. The second most probable word is 'apricot' with a log-probability of -0.9. A simple greedy approach would select 'apple'. However, the best complete sequence actually starts with 'apricot'. Explain, in detail, the mechanism by which a search process that keeps track of multiple hypotheses at each step could arrive at the better overall sequence, even though it did not start with the most probable first word.

Google

Beam search is a sequence decoding strategy that strikes a compromise between the computational efficiency of greedy search and the optimality of exhaustive search. Instead of greedily picking the single most likely token or exploring all possible paths, beam search evaluates and retains a predetermined number of the most promising candidate sequences at each step of the generation process.

Beam Search Strategy in Sequence-to-Sequence Models

In the beam search algorithm, the process of expanding a hypothesis involves starting from a parent node, which represents a given prefix sequence (y1...yi−1), and then selecting the K most probable next tokens from the vocabulary. This step generates K new, longer candidate sequences to be considered in the next stage of the search.

Top-K Token Selection in Beam Search

A text generation model is creating a sequence of words. It uses a search process that keeps track of the 2 most probable sequences at each step. The score for a sequence is the sum of the log-probabilities of its words. Given the state of the search below, which two sequences will be kept for the next step?

**Step 1:** The initial two sequences being tracked are:
*   Sequence 1: "The" (Score: -0.5)
*   Sequence 2: "A" (Score: -0.9)

**Step 2:** The model calculates the log-probabilities for the next possible words for each sequence:
*   Expanding "The":
    *   "cat": -0.8
    *   "dog": -1.1
*   Expanding "A":
    *   "mouse": -0.2
    *   "lion": -1.5

Analyzing Search Algorithm Behavior

A text generation model is tasked with producing a summary. It explores several candidate summaries and calculates a score for each by summing the log-probabilities of its words. The model's goal is to output the sequence with the highest score. Review the two final candidates below and explain the fundamental flaw in this scoring method that leads the model to select the suboptimal summary. Then, describe the general principle of a technique that could correct for this flaw.

Diagnosing a Flaw in Sequence Generation

You are tuning decoding for an internal "meeting-n...

You’re deploying an LLM to draft customer-facing i...

You’re building an internal “RFP response drafter”...

You’re implementing an LLM feature that generates ...

You own an internal LLM feature that drafts customer-facing incident updates. After a model upgrade, stakeholders report two issues: (1) outputs are often prematurely short (missing key details), and (2) when you try to increase “creativity,” some drafts become repetitive or slightly incoherent. You are not allowed to change the model weights—only the decoding configuration.

Write a post-incident proposal that recommends a single decoding strategy (you may combine methods) and a tuning plan that explicitly connects: (a) the choice between greedy decoding vs beam search vs sampling, (b) how you would set and justify either top-k or top-p (nucleus) sampling, (c) how you would use temperature-scaled softmax in combination with your sampling choice, and (d) how you would apply a length penalty (or length normalization) so that longer, more complete updates are not unfairly disfavored.

Your answer must explain the tradeoffs and interactions among these controls (e.g., how temperature changes the effective candidate distribution before top-k/top-p truncation, and how length penalty changes which sequences win under beam search), and it must end with a concrete “default” configuration plus a brief rollback/monitoring plan (what metrics or failure modes you would watch for).

Post-incident analysis: fixing repetition and truncation by tuning decoding

You own the generation layer for an internal, regulated customer-support assistant. Two issues are reported after a model upgrade:

1) For short answers (target: 1–2 sentences), the assistant often produces overly long, meandering responses.
2) For longer answers (target: 6–10 sentences), the assistant is repetitive and sometimes “locks in” early to a suboptimal phrasing that later forces awkward continuations.

You are not allowed to change the model weights—only the decoding strategy and its parameters. Propose a single coherent decoding policy (you may use different settings by response-length tier, but keep the approach consistent) that addresses both issues. In your justification, explicitly explain how your choices combine: (a) a deterministic search method (greedy or beam search) versus a sampling method (top-k or top-p), (b) temperature scaling, and (c) a length penalty. Your answer must describe the tradeoffs you are making (e.g., predictability vs. diversity, local vs. global sequence quality, and how length controls interact with search/sampling) and why your policy would reduce both overlong short answers and repetitive long answers in production.

Debugging Decoding: Balancing Determinism, Diversity, and Length in a Regulated Product

You are deploying the same LLM behind two internal products:

1) A compliance assistant that drafts short, auditable policy answers where reproducibility is required (the same input should yield the same output), and answers must not become overly long.
2) A brainstorming assistant for product managers where novelty and variety are valued, but outputs must remain coherent and not drift into low-probability “nonsense.”

Write a recommendation memo that proposes a decoding configuration for each product. Your memo must:
- Choose between greedy decoding, beam search, top-k sampling, and top-p (nucleus) sampling for each product, and justify the choice in terms of determinism vs. diversity and how candidate-set pruning works.
- Specify how you would use temperature scaling in the sampling-based configuration(s) (e.g., higher/lower temperature) and explain the expected effect on the renormalized token probabilities.
- Explain whether and how you would apply a length penalty (or length normalization) in the deterministic configuration(s), including the failure mode it is intended to prevent.
- Explicitly discuss at least one tradeoff you are accepting in each product (e.g., quality vs. diversity, compute vs. optimality, brevity vs. completeness) and why it is appropriate for that product’s constraints.

Selecting and Justifying a Decoding Policy for Two Production Use Cases

You are deploying an internal LLM feature that drafts customer-facing email replies for account managers. Requirements: (1) average latency must stay under 250 ms, (2) outputs must not be overly “templated” across similar tickets (product wants noticeable variation), and (3) replies must be 90–130 words; current outputs are often ~50 words and sometimes end abruptly. You can change only decoding-time settings (no fine-tuning). 

Propose ONE concrete decoding configuration (algorithm + key parameters) that best meets all three requirements, and justify it by explaining how your choices jointly affect (a) determinism vs diversity, (b) search/compute cost, and (c) output length. Your answer must explicitly address why you did NOT choose at least one plausible alternative (e.g., greedy, beam search, top-k, or top-p) given the constraints.

Choosing a Decoding Configuration Under Latency, Diversity, and Length Constraints

You are the on-call ML lead for a B2B product that generates 1–2 sentence “Account Update” summaries from internal CRM notes. A pilot with 200 users shows two failure modes:

1) Some summaries are overly short (often 6–10 words) and omit key facts.
2) Other summaries are fluent but occasionally include a plausible-sounding detail that is not in the notes.

Constraints: p95 latency must stay under 250 ms; the product team wants consistent tone across runs for the same input, but they also want to avoid repetitive phrasing across different accounts. You can change only decoding-time settings (no retraining, no prompt changes).

You must choose ONE decoding approach to ship this week and specify concrete settings for: (a) the core decoding algorithm (greedy, beam search, top-k sampling, or top-p sampling), (b) whether/how you will use temperature scaling, and (c) whether/how you will apply a length penalty. In your answer, justify how your choices jointly address both failure modes and the stated constraints, including at least one tradeoff you are accepting.

Release-readiness decision: decoding configuration for a customer-facing summarization feature

You own the decoding configuration for an internal multilingual customer-support assistant that drafts replies agents can send with minimal edits. The business constraints are: (1) replies must be deterministic for the same ticket to support auditability, (2) average generation latency must stay under 250 ms, (3) replies must not be overly short (agents complain about missing key steps) but also must not ramble, and (4) the current model sometimes produces a safe but generic first sentence and then either ends too early or repeats itself.

You are given two candidate configurations to choose from for the next release:

Config A:
- Greedy decoding
- No length penalty

Config B:
- Beam search with beam width B=4
- Apply a length penalty that discourages very short completions

A third option is proposed by another team:

Config C:
- Top-p (nucleus) sampling with p=0.9
- Temperature-scaled softmax with β=1.3
- No length penalty

Case study task: Choose the single best configuration (A, B, or C) for this release and justify your choice by explicitly explaining how your selected approach handles (i) determinism vs diversity, (ii) the risk of premature stopping vs overly long outputs (including the role of length penalty), and (iii) latency/compute tradeoffs. Your justification must reference at least two concrete failure modes described above and explain why the other two configurations are less suitable under the stated constraints.

Decoding policy decision for a multilingual support assistant under safety, latency, and verbosity constraints

The computational cost of generating a sequence using beam search is bounded by $$\mathcal{O}(k\left|\mathcal{Y}ight|T')$$, where $$k$$ represents the beam size, $$\left|\mathcal{Y}ight|$$ is the size of the output vocabulary, and $$T'$$ is the maximum length of the generated sequence. This measurable cost situates beam search as an intermediate strategy, rendering it more computationally intensive than greedy search but considerably more tractable than exhaustive search.

Computational Cost of Beam Search in Sequence-to-Sequence Models

The most straightforward version of beam search relies on a single hyperparameter called the beam size, denoted as $$k$$. The beam size specifies the exact number of top candidate sequences that the algorithm maintains at each generation time step, providing a flexible trade-off between the accuracy of the output and the overall computational cost.

Learn Before

Related