Case Study

Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score

You are reviewing an internal evaluation script for an autoregressive LLM used to rank two candidate completions for the same prompt x. The script is supposed to choose the completion y that maximizes the conditional log-probability log Pr(y|x), computed as a sum of next-token log-probabilities. However, the script’s author claims they can compare candidates by summing the raw logits (pre-softmax scores) of the chosen tokens at each position, because “softmax is monotonic so it won’t change the ranking.”

In one example, the model produces the following logits over a 3-token vocabulary {A, B, C} at each generation step (higher logit = higher score). Candidate 1 is y^(1) = [A, A]; Candidate 2 is y^(2) = [B, B].

Step 1 logits given x:

  • u(A)=10, u(B)=9, u(C)=0

Step 2 logits given x and the first generated token:

  • if the first token was A: u(A)=0, u(B)=0, u(C)=0
  • if the first token was B: u(A)=8, u(B)=7, u(C)=0

The script’s current scoring method sums the selected-token logits across steps (e.g., score(y)=u(y1)+u(y2) using the appropriate conditional logits at step 2).

As the reviewer, determine which candidate should be selected under the correct inference objective, and explain why the “sum of logits” method can produce a different ranking in this case. Your explanation must explicitly connect (1) autoregressive decomposition into next-token conditionals, (2) softmax’s role in turning logits into probabilities, and (3) why training/inference use log-likelihood (log-probability) rather than raw logits for sequence scoring.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Data Science

Related