1Cademy - Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring

Learn Before

Essay

Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring

You are reviewing an internal evaluation script for an autoregressive LLM used to draft customer-support replies. The script is supposed to (a) compute the total conditional log-probability of a candidate reply y given a prompt x, and (b) explain why the model preferred one reply over another.

A teammate reports a suspicious pattern: for many prompts, the script claims the model assigns extremely high probability to the next token (often >0.99) and therefore very high total sequence probability, even when the chosen reply is clearly worse. You inspect the code and find two implementation choices:

At each position i, it takes the model’s logits vector u^(i) over the vocabulary and converts it to “probabilities” by dividing each logit by the sum of logits (i.e., p_k = u_k / sum_j u_j), without exponentiating.
To score a full candidate reply y = (y_1,...,y_n), it multiplies the per-token probabilities across positions to get Pr(y|x), and then takes log at the end (i.e., log(∏i Pr(y_i|x,y<i))).

Write an analysis explaining, in a way that a software engineer could act on, why these choices can produce misleading next-token probabilities and incorrect sequence comparisons. Your answer must:

Use the correct mathematical objective for inference-time sequence selection (argmax over log Pr(y|x)) and its autoregressive decomposition.
Explain the role of softmax in turning logits into a valid conditional next-token distribution Pr(y_i|x,y_<i), and what goes wrong when you “normalize logits” directly.
Connect the per-token conditional probabilities to the log-likelihood-style sum used for stable sequence scoring, and explain why the sum of log-probabilities is the standard computation.
Propose a corrected scoring approach (in words and/or formulas) that would let the team reliably compare two candidate replies of different lengths without numerical issues.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related