Case Study

Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability

You are reviewing a production LLM feature that ranks two candidate continuations for the same user prompt x by computing a score for each full continuation y and choosing the higher-scoring one. The intended scoring rule is to choose y^=argmaxylogPr(yx)\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \Pr(\mathbf{y}\mid\mathbf{x}), using the autoregressive decomposition logPr(yx)=i=1nlogPr(yix,y<i)\log \Pr(\mathbf{y}\mid\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i\mid\mathbf{x}, \mathbf{y}_{<i}). The model produces next-token logits at each step, which should be converted to probabilities with softmax over the full vocabulary.

A teammate implemented the scorer as follows: at each step i, they take the logit for the candidate’s next token uyiu_{y_i} and subtract log(tSiexp(ut))\log\big(\sum_{t \in S_i} \exp(u_t)\big), where SiS_i is a small, request-specific shortlist of 50 tokens (not the full vocabulary). They then sum these per-token values across the continuation to get the sequence score.

In an A/B test, the system starts preferring verbose, low-quality continuations that repeat common tokens, even when the model’s raw logits for the “good” continuation look higher at several positions.

As the on-call ML engineer, analyze whether the teammate’s scoring method is mathematically consistent with maximizing logPr(yx)\log \Pr(\mathbf{y}\mid\mathbf{x}) under an autoregressive language model. In your answer, explain (1) how softmax normalization affects the conditional next-token probability Pr(yix,y<i)\Pr(y_i\mid\mathbf{x},\mathbf{y}_{<i}), (2) why using a changing shortlist SiS_i can distort the summed log-likelihood comparison between two full sequences, and (3) what concrete change you would make to the scoring computation to correctly rank candidates by logPr(yx)\log \Pr(\mathbf{y}\mid\mathbf{x}).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Data Science

Related