Case Study

Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability

You are reviewing a production LLM feature that ranks two candidate continuations for the same user prompt x by computing a score for each full continuation y and choosing the higher-scoring one. The intended scoring rule is to choose (\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \Pr(\mathbf{y}\mid\mathbf{x})), using the autoregressive decomposition (\log \Pr(\mathbf{y}\mid\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i\mid\mathbf{x}, \mathbf{y}_{<i})). The model produces next-token logits at each step, which should be converted to probabilities with softmax over the full vocabulary.

A teammate implemented the scorer as follows: at each step i, they take the logit for the candidate’s next token (u_{y_i}) and subtract (\log\big(\sum_{t \in S_i} \exp(u_t)\big)), where (S_i) is a small, request-specific shortlist of 50 tokens (not the full vocabulary). They then sum these per-token values across the continuation to get the sequence score.

In an A/B test, the system starts preferring verbose, low-quality continuations that repeat common tokens, even when the model’s raw logits for the “good” continuation look higher at several positions.

As the on-call ML engineer, analyze whether the teammate’s scoring method is mathematically consistent with maximizing (\log \Pr(\mathbf{y}\mid\mathbf{x})) under an autoregressive language model. In your answer, explain (1) how softmax normalization affects the conditional next-token probability (\Pr(y_i\mid\mathbf{x},\mathbf{y}_{<i})), (2) why using a changing shortlist (S_i) can distort the summed log-likelihood comparison between two full sequences, and (3) what concrete change you would make to the scoring computation to correctly rank candidates by (\log \Pr(\mathbf{y}\mid\mathbf{x})).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Data Science

Related