Essay

Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability

You are reviewing an internal demo of an autoregressive LLM used to draft short customer-support replies. For a given prompt x, the model must generate exactly two tokens y1 y2 (then stop). The engineer shows you the model’s final-layer logits (before Softmax) for the next token at step 1, and then (depending on the chosen y1) the logits for step 2:

Step 1 logits over the vocabulary {A, B}: u(A)=0, u(B)=0.

If y1=A, then Step 2 logits over {C, D}: u(C)=10, u(D)=0. If y1=B, then Step 2 logits over {C, D}: u(C)=1, u(D)=1.

The engineer claims: “Because A and B are equally likely at step 1, greedy decoding is fine; it will pick either A or B, and then we’ll get the best overall two-token completion anyway.”

Write an analysis that (1) computes the relevant next-token probabilities using Softmax at each step, (2) uses the autoregressive decomposition to compute and compare the total conditional log-probability log Pr(y1,y2|x) for the best completion under y1=A versus under y1=B, and (3) explains—using the log-likelihood/sequence-scoring perspective—whether the engineer’s claim is correct and why. Your answer should make clear how per-step conditional probabilities interact to determine the best overall sequence under the argmax objective.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Data Science

Related