Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
You are reviewing an internal evaluation script for an autoregressive LLM used to draft customer-support replies. The script is supposed to (a) compute the total conditional log-probability of a candidate reply y given a prompt x, and (b) explain why the model preferred one reply over another.
A teammate reports a suspicious pattern: for many prompts, the script claims the model assigns extremely high probability to the next token (often >0.99) and therefore very high total sequence probability, even when the chosen reply is clearly worse. You inspect the code and find two implementation choices:
- At each position i, it takes the model’s logits vector u^(i) over the vocabulary and converts it to “probabilities” by dividing each logit by the sum of logits (i.e., p_k = u_k / sum_j u_j), without exponentiating.
- To score a full candidate reply y = (y_1,...,y_n), it multiplies the per-token probabilities across positions to get Pr(y|x), and then takes log at the end (i.e., log(∏i Pr(y_i|x,y<i))).
Write an analysis explaining, in a way that a software engineer could act on, why these choices can produce misleading next-token probabilities and incorrect sequence comparisons. Your answer must:
- Use the correct mathematical objective for inference-time sequence selection (argmax over log Pr(y|x)) and its autoregressive decomposition.
- Explain the role of softmax in turning logits into a valid conditional next-token distribution Pr(y_i|x,y_<i), and what goes wrong when you “normalize logits” directly.
- Connect the per-token conditional probabilities to the log-likelihood-style sum used for stable sequence scoring, and explain why the sum of log-probabilities is the standard computation.
- Propose a corrected scoring approach (in words and/or formulas) that would let the team reliably compare two candidate replies of different lengths without numerical issues.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Data Science
Related
Conditional Probability in Sequence-to-Sequence Generation
Next-Token Probability Calculation in Autoregressive Decoders
Example of Autoregressive Generation and Log-Probability Calculation
An auto-regressive language model is generating text following the input 'The cat sat on the'. The model's objective is to find the output sequence with the highest total log-probability. It is considering two possible two-word continuations:
Path A: 'warm mat'
- log Pr('warm' | 'The cat sat on the') = -0.9
- log Pr('mat' | 'The cat sat on the warm') = -1.5
Path B: 'plush rug'
- log Pr('plush' | 'The cat sat on the') = -1.2
- log Pr('rug' | 'The cat sat on the plush') = -1.1
Based on the provided conditional log-probabilities, which path will the model choose and why?
Debugging a Generation Model's Choice
Greedy Decoding vs. Optimal Sequence Probability
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Your team is building an internal tool that ranks ...
You’re reviewing an internal evaluation script tha...
You’re reviewing an internal LLM evaluation pipeli...
Direct Computation of Output Sequence Log-Probability in LLMs
Mathematical Justification for Greedy Search
A language model needs to compute the total log-probability for generating the specific three-token sequence
y = (y_1, y_2, y_3)given an inputx. Based on the standard autoregressive formulation, which of the following expressions correctly represents this calculation?Calculating Sequence Log-Probability
Analysis of Text Generation Approaches
You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Schematic of Probability Calculation in Causal Language Modeling
An autoregressive language model is given the sequence of tokens: 'The', 'cat', 'sat', 'on', 'the'. It is now tasked with predicting the very next token. Which of the following expressions correctly represents the primary calculation the model performs to determine the likelihood of the word 'mat' appearing next?
Contextual Influence on Token Probability
Analyzing Contextual Influence on Next-Token Probability
You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Neural Network-Based Next-Token Probability Distribution
Initial Token Probability Assumption
Calculating Sequence Log-Likelihood
A language model is being trained on the sentence '
The cat sat'. The model calculates the following conditional log-probabilities at each step, where ' ' is a fixed start-of-sequence token: log P('The' | '<BOS>') = -1.5log P('cat' | '<BOS>', 'The') = -0.9log P('sat' | '<BOS>', 'The', 'cat') = -1.2
Based on the standard training objective for this single sequence, what is the total log-likelihood value that the model aims to maximize?
Model Output Evaluation
You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Pros and Cons of Softmax Function
Softmax Regression (Activation)
Parameterized Softmax Layer
Plackett-Luce Selection Probability Formula
Conditional Probability Formula for Autoregressive Models using Softmax
A neural network's final layer produces the raw output scores (logits)
[2.0, 1.0, 0.1]for three possible classes. To convert these scores into class probabilities, a function is applied that first exponentiates each score and then normalizes these new values by dividing each by their sum. What is the resulting probability distribution? (Values are rounded to three decimal places).A function is used to convert a vector of raw, unnormalized scores
z = [z_1, z_2, ..., z_K]into a probability distribution. This function operates by first applying the standard exponential function to each score and then normalizing these new values by dividing each by their sum. If a constant valueCis added to every score in the input vectorz, resulting in a new vectorz' = [z_1+C, z_2+C, ..., z_K+C], how will the resulting output probability distribution be affected?Consider two input vectors of raw scores (logits) for a 3-class classification problem: Vector A =
[1, 2, 3]and Vector B =[1, 5, 10]. Both vectors are passed through a function that exponentiates each score and then normalizes the results by dividing by their sum. How will the resulting probability distribution for Vector B compare to the one for Vector A?You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Derivative of Softmax Cross-Entropy Loss with Respect to Logits
Numerical Overflow in Softmax Function