A language model is generating a response to the input 'New York is a'. At the first step, the token 'city' has a higher probability than the token 'big'. However, the globally optimal two-word completion is found to be 'big apple'. Explain, using the mathematical objective of inference, how it is possible for a sequence starting with a less probable word ('big') to ultimately have a higher total log-probability than a sequence starting with a more probable word ('city').

Google

The inference process for Large Language Models is mathematically defined as identifying the most probable output sequence $$\mathbf{y}$$ based on a given input context $$\mathbf{x}$$. This involves determining the sequence $$\hat{\mathbf{y}}$$ that maximizes the conditional log-probability: $$\hat{\mathbf{y}} = \argmax_{\mathbf{y}} \log \Pr(\mathbf{y} | \mathbf{x})$$. To account for the step-by-step nature of text generation, this equation calculates the sum of the log-probabilities for predicting each individual token $$y_i$$ starting from position $$m+1$$, rather than position $${}0$$. Each token's probability is conditioned on the initial context sequence ($$x_0,...,x_m$$) and all prior generated tokens ($$y_1,...,y_{i-1}$$): $$\hat{\mathbf{y}} = \argmax_{\mathbf{y}} \sum_{i=1}^{n} \log \Pr(y_i|x_0,...,x_m,y_1,...,y_{i-1})$$.

Mathematical Formulation of LLM Inference

In sequence-to-sequence models, the probability of generating a specific output token is conditioned on both the entire input sequence and all previously generated output tokens. This is represented by the formula $Pr(y_i | x_1, ..., x_m, y_1, ..., y_{i-1})$, where $y_i$ is the current output token, $(x_1, ..., x_m)$ is the complete input sequence, and $(y_1, ..., y_{i-1})$ are the output tokens already generated. This conditional probability is the core calculation performed at each step of the auto-regressive generation process.

Conditional Probability in Sequence-to-Sequence Generation

In an autoregressive decoder, the probability for the next token is calculated at each step `i` by conditioning on the input `x` and all previously generated tokens `y_{<i}`. The process begins by concatenating `x` and `y_{<i}` and passing them through an embedding layer. This sequence of embeddings is then processed by a stack of decoder layers (which typically include self-attention and feed-forward networks) to produce a sequence of hidden states, `H`. A final linear transformation (using an output weight matrix `W^o`) is applied to these hidden states to get logits, followed by a Softmax function. The probability distribution for the next token, `y_i`, is the probability vector taken from the final position of the output sequence. The formula is: `Pr(·|x, y_{<i}) = (\text{Softmax}(H W^o))_{\text{last}}` where `H = \text{Decoder}([x, y_{<i}])`.

Next-Token Probability Calculation in Autoregressive Decoders

This example illustrates how an autoregressive model generates the sentence 'cats are playful.' by following a specific path through a search space (e.g., node 0 → 3 → 9 → 11 → 17). The overall probability of this generated sequence is calculated by summing the conditional log-probabilities of each token. The calculation unfolds sequentially as follows:
*   `log Pr("cats"|x)`
*   `log Pr("are"|x, "cats")`
*   `log Pr("playful"|x, "cats are")`
*   `log Pr("."|x, "cats are playful")`
Each term represents the log-probability of generating the current token, given the input `x` and all previously generated tokens in the sequence.

Example of Autoregressive Generation and Log-Probability Calculation

An auto-regressive language model is generating text following the input 'The cat sat on the'. The model's objective is to find the output sequence with the highest total log-probability. It is considering two possible two-word continuations:

Path A: 'warm mat'
- log Pr('warm' | 'The cat sat on the') = -0.9
- log Pr('mat' | 'The cat sat on the warm') = -1.5

Path B: 'plush rug'
- log Pr('plush' | 'The cat sat on the') = -1.2
- log Pr('rug' | 'The cat sat on the plush') = -1.1

Based on the provided conditional log-probabilities, which path will the model choose and why?

Analyze the following scenario and explain the model's behavior based on its core mathematical objective for generating text.

Debugging a Generation Model's Choice

Greedy Decoding vs. Optimal Sequence Probability

You are reviewing an internal incident report: a product team claims their LLM “should have generated” a particular 3-token continuation y = (y1, y2, y3) after a prompt x because, at each step, the model assigned the highest next-token probability to the token that appears in that continuation. Another team counters that the correct inference objective is to choose the continuation that maximizes the conditional probability of the entire sequence given x, and that this can disagree with stepwise top-1 choices.

Write an analysis that (1) states the mathematical inference objective for selecting an output sequence given x, (2) decomposes that objective autoregressively into next-token conditional probabilities, (3) explains how the model obtains each next-token probability from logits using softmax, and (4) connects this to the training objective by explaining how maximizing log-likelihood over data relates to (but does not guarantee) greedy stepwise selection at inference. Your answer should explicitly use log-probabilities (sum of logs) to justify why “highest at each step” is not the same claim as “highest total sequence probability,” and should include at least one concrete numeric mini-example (you may invent numbers) showing how two different 3-token continuations can lead to this disagreement.

Reconciling Training Log-Likelihood with Inference-Time Sequence Selection

You are reviewing an internal evaluation script for an autoregressive LLM used to draft customer-support replies. The script is supposed to (a) compute the total conditional log-probability of a candidate reply y given a prompt x, and (b) explain why the model preferred one reply over another.

A teammate reports a suspicious pattern: for many prompts, the script claims the model assigns extremely high probability to the next token (often >0.99) and therefore very high total sequence probability, even when the chosen reply is clearly worse. You inspect the code and find two implementation choices:

1) At each position i, it takes the model’s logits vector u^(i) over the vocabulary and converts it to “probabilities” by dividing each logit by the sum of logits (i.e., p_k = u_k / sum_j u_j), without exponentiating.
2) To score a full candidate reply y = (y_1,...,y_n), it multiplies the per-token probabilities across positions to get Pr(y|x), and then takes log at the end (i.e., log(∏_i Pr(y_i|x,y_<i))).

Write an analysis explaining, in a way that a software engineer could act on, why these choices can produce misleading next-token probabilities and incorrect sequence comparisons. Your answer must:
- Use the correct mathematical objective for inference-time sequence selection (argmax over log Pr(y|x)) and its autoregressive decomposition.
- Explain the role of softmax in turning logits into a valid conditional next-token distribution Pr(y_i|x,y_<i), and what goes wrong when you “normalize logits” directly.
- Connect the per-token conditional probabilities to the log-likelihood-style sum used for stable sequence scoring, and explain why the sum of log-probabilities is the standard computation.
- Propose a corrected scoring approach (in words and/or formulas) that would let the team reliably compare two candidate replies of different lengths without numerical issues.

Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring

You are reviewing an internal demo of an autoregressive LLM used to draft short customer-support replies. For a given prompt x, the model must generate exactly two tokens y1 y2 (then stop). The engineer shows you the model’s final-layer logits (before Softmax) for the next token at step 1, and then (depending on the chosen y1) the logits for step 2:

Step 1 logits over the vocabulary {A, B}: u(A)=0, u(B)=0.

If y1=A, then Step 2 logits over {C, D}: u(C)=10, u(D)=0.
If y1=B, then Step 2 logits over {C, D}: u(C)=1, u(D)=1.

The engineer claims: “Because A and B are equally likely at step 1, greedy decoding is fine; it will pick either A or B, and then we’ll get the best overall two-token completion anyway.”

Write an analysis that (1) computes the relevant next-token probabilities using Softmax at each step, (2) uses the autoregressive decomposition to compute and compare the total conditional log-probability log Pr(y1,y2|x) for the best completion under y1=A versus under y1=B, and (3) explains—using the log-likelihood/sequence-scoring perspective—whether the engineer’s claim is correct and why. Your answer should make clear how per-step conditional probabilities interact to determine the best overall sequence under the argmax objective.

Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability

You are reviewing an internal evaluation report for a customer-support LLM. The report claims the model would prefer Completion A over Completion B for the same prompt because “A has higher probability.” You suspect the analyst mixed up logits, probabilities, and sequence scoring.

Using ONLY the information below, determine which completion is actually more likely under the model (i.e., has higher conditional log-probability given the prompt), and briefly explain the reasoning steps you used (including how softmax, next-token conditional probabilities, and autoregressive decomposition combine into a single sequence score).

Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability

You are reviewing a production LLM feature that ranks two candidate continuations for the same user prompt x by computing a score for each full continuation y and choosing the higher-scoring one. The intended scoring rule is to choose $\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \Pr(\mathbf{y}\mid\mathbf{x})$, using the autoregressive decomposition $\log \Pr(\mathbf{y}\mid\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i\mid\mathbf{x}, \mathbf{y}_{<i})$. The model produces next-token logits at each step, which should be converted to probabilities with softmax over the full vocabulary.

A teammate implemented the scorer as follows: at each step i, they take the logit for the candidate’s next token $u_{y_i}$ and subtract $\log\big(\sum_{t \in S_i} \exp(u_t)\big)$, where $S_i$ is a small, request-specific shortlist of 50 tokens (not the full vocabulary). They then sum these per-token values across the continuation to get the sequence score.

In an A/B test, the system starts preferring verbose, low-quality continuations that repeat common tokens, even when the model’s raw logits for the “good” continuation look higher at several positions.

As the on-call ML engineer, analyze whether the teammate’s scoring method is mathematically consistent with maximizing $\log \Pr(\mathbf{y}\mid\mathbf{x})$ under an autoregressive language model. In your answer, explain (1) how softmax normalization affects the conditional next-token probability $\Pr(y_i\mid\mathbf{x},\mathbf{y}_{<i})$, (2) why using a changing shortlist $S_i$ can distort the summed log-likelihood comparison between two full sequences, and (3) what concrete change you would make to the scoring computation to correctly rank candidates by $\log \Pr(\mathbf{y}\mid\mathbf{x})$.

Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability

You are reviewing an internal evaluation script for an autoregressive LLM used to rank two candidate completions for the same prompt x. The script is supposed to choose the completion y that maximizes the conditional log-probability log Pr(y|x), computed as a sum of next-token log-probabilities. However, the script’s author claims they can compare candidates by summing the *raw logits* (pre-softmax scores) of the chosen tokens at each position, because “softmax is monotonic so it won’t change the ranking.”

In one example, the model produces the following logits over a 3-token vocabulary {A, B, C} at each generation step (higher logit = higher score). Candidate 1 is y^(1) = [A, A]; Candidate 2 is y^(2) = [B, B].

Step 1 logits given x:
- u(A)=10, u(B)=9, u(C)=0

Step 2 logits given x and the first generated token:
- if the first token was A: u(A)=0, u(B)=0, u(C)=0
- if the first token was B: u(A)=8, u(B)=7, u(C)=0

The script’s current scoring method sums the selected-token logits across steps (e.g., score(y)=u(y1)+u(y2) using the appropriate conditional logits at step 2).

As the reviewer, determine which candidate should be selected under the *correct* inference objective, and explain why the “sum of logits” method can produce a different ranking in this case. Your explanation must explicitly connect (1) autoregressive decomposition into next-token conditionals, (2) softmax’s role in turning logits into probabilities, and (3) why training/inference use log-likelihood (log-probability) rather than raw logits for sequence scoring.

Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score

You are building an internal evaluation service that ranks multiple candidate completions produced by an autoregressive LLM for the same prompt. The model returns, for each generation step i, a vector of logits u_i over the vocabulary V (one logit per token) computed from the context (prompt x plus previously generated tokens y_<i). Your service must (1) compute the conditional probability of each chosen token y_i from u_i, (2) compute the total conditional log-probability log Pr(y|x) for the full candidate sequence y = (y_1,...,y_n) using the autoregressive decomposition, and (3) return the best candidate y_hat = argmax_y log Pr(y|x). 

Create a precise, implementation-ready specification (math + clear pseudocode) for a function score_and_select(prompt_tokens x, candidates Y, logits_by_candidate U) that returns (best_candidate, scores). Your spec must explicitly show: how Softmax converts logits to next-token probabilities; how you extract Pr(y_i|x,y_<i) for the actually generated token at each step; how you aggregate across steps into a single sequence score consistent with the inference objective; and how this relates to the log-likelihood objective used in training (i.e., what quantity training maximizes that your scorer is computing at inference). Assume <BOS> is a fixed start token with probability 1 and candidates may have different lengths; include how you handle length in the score (e.g., stop at <EOS> if present).

Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs

Your team is building an internal tool that ranks ...

You’re reviewing an internal evaluation script tha...

You’re reviewing an internal LLM evaluation pipeli...

In common implementations of Large Language Models (LLMs), the log-probability of the input sequence does not need to be computed. Instead, the model directly computes the conditional log-probability of the output sequence given the input. This is done by summing the log-probabilities of each individual output token. The formula is:

$$ \log \Pr(\mathbf{y}|\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i|\mathbf{x},\mathbf{y}_{<i}) $$

In this notation, $$[\mathbf{x},\mathbf{y}_{<i}]$$ represents the context used for predicting the token $$y_i$$. Furthermore, the expression $$\Pr(y_i|\mathbf{x},\mathbf{y}_{<i})$$ is a common literature shorthand used to denote $$\Pr(y_i|[\mathbf{x},\mathbf{y}_{<i}])$$.

Learn Before

Related