Google

To convert raw, unnormalized outputs $$\mathbf{o}$$ into valid probabilities, the softmax function applies an exponential function to each component and then normalizes them by their sum. The exponentiation ensures that all probabilities are non-negative, while the division ensures that they sum to $$1$$. Mathematically, the predicted probability distribution $$\hat{\mathbf{y}}$$ is defined as:

$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad 	extrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}$$

This guarantees that $$0 \le \hat{y}_i \le 1$$ and $$\sum_j \hat{y}_j = 1$$. Unlike other normalizations or the probit model, the softmax function preserves order and leads to a more well-behaved optimization problem.

Softmax Function

Pros:
 - Able to handle multiple classes only one class in other activation functions—normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.
 - Useful for output neurons—typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.

Pros and Cons of Softmax Function

Softmax regression is the generalization of logistic regression to multiple classes. In other words, each data point belongs to one of multiple classes (rather than just two options, as is the case for logistic regression). Hence softmax regression is also called multi-class logistic regression or multinomial logistic regression.

Softmax Regression (Activation)

A parameterized Softmax layer, denoted as $$\mathrm{Softmax}_{\mathbf{W}}(\cdot)$$, incorporates a set of weights, $$\mathbf{W}$$. This layer operates by first applying a linear transformation to the input hidden states, $$\mathbf{H}$$, using the weight matrix $$\mathbf{W}$$, and then passing the result through the standard Softmax function. This operation is formally defined by the equation: $$\mathrm{Softmax}_{\mathbf{W}}(\mathbf{H}) = \mathrm{Softmax}(\mathbf{H} \cdot \mathbf{W})$$.

Parameterized Softmax Layer

In the Plackett-Luce model, the probability of selecting a specific response $$\mathbf{y}$$ from a set of possible responses $$Y$$ given an input $$\mathbf{x}$$, is calculated by normalizing its "worth" value, $$\alpha(\mathbf{y})$$. The selection probability is the worth of the selected response divided by the sum of the worths of all possible responses:
$$ \Pr(\mathbf{y}\text{ is selected}|\mathbf{x},Y) = \frac{\alpha(\mathbf{y})}{\sum_{\mathbf{y}' \in Y} \alpha(\mathbf{y}')} = \frac{\exp(r(\mathbf{x},\mathbf{y}))}{\sum_{\mathbf{y}' \in Y} \exp(r(\mathbf{x},\mathbf{y}'))} $$

Plackett-Luce Selection Probability Formula

In autoregressive models, the conditional probability of the next token $$y_i$$, given an input $$\mathbf{x}$$ and the preceding tokens $$\mathbf{y}_{<i}$$, is often calculated using the softmax function. This is expressed as: $$\overline{\text{Pr}}(y_i|\mathbf{x}, \mathbf{y}_{<i}) = \frac{\exp(u_{y_i})}{\sum_{y_j \in \overline{V}_i} \exp(u_{y_j})}$$ Here, $$u_{y_i}$$ represents the unnormalized score (or logit) for the token $$y_i$$. The probability is obtained by exponentiating this score and normalizing it by the sum of exponentiated scores of all candidate tokens $$y_j$$ within a specific vocabulary subset $$\overline{V}_i$$.

Conditional Probability Formula for Autoregressive Models using Softmax

A neural network's final layer produces the raw output scores (logits) `[2.0, 1.0, 0.1]` for three possible classes. To convert these scores into class probabilities, a function is applied that first exponentiates each score and then normalizes these new values by dividing each by their sum. What is the resulting probability distribution? (Values are rounded to three decimal places).

A function is used to convert a vector of raw, unnormalized scores `z = [z_1, z_2, ..., z_K]` into a probability distribution. This function operates by first applying the standard exponential function to each score and then normalizing these new values by dividing each by their sum. If a constant value `C` is added to every score in the input vector `z`, resulting in a new vector `z' = [z_1+C, z_2+C, ..., z_K+C]`, how will the resulting output probability distribution be affected?

Consider two input vectors of raw scores (logits) for a 3-class classification problem: Vector A = `[1, 2, 3]` and Vector B = `[1, 5, 10]`. Both vectors are passed through a function that exponentiates each score and then normalizes the results by dividing by their sum. How will the resulting probability distribution for Vector B compare to the one for Vector A?

You’re reviewing an internal evaluation script tha...

Your team is building an internal tool that ranks ...

You’re reviewing an internal LLM evaluation pipeli...

You are reviewing an internal incident report: a product team claims their LLM “should have generated” a particular 3-token continuation y = (y1, y2, y3) after a prompt x because, at each step, the model assigned the highest next-token probability to the token that appears in that continuation. Another team counters that the correct inference objective is to choose the continuation that maximizes the conditional probability of the entire sequence given x, and that this can disagree with stepwise top-1 choices.

Write an analysis that (1) states the mathematical inference objective for selecting an output sequence given x, (2) decomposes that objective autoregressively into next-token conditional probabilities, (3) explains how the model obtains each next-token probability from logits using softmax, and (4) connects this to the training objective by explaining how maximizing log-likelihood over data relates to (but does not guarantee) greedy stepwise selection at inference. Your answer should explicitly use log-probabilities (sum of logs) to justify why “highest at each step” is not the same claim as “highest total sequence probability,” and should include at least one concrete numeric mini-example (you may invent numbers) showing how two different 3-token continuations can lead to this disagreement.

Reconciling Training Log-Likelihood with Inference-Time Sequence Selection

You are reviewing an internal demo of an autoregressive LLM used to draft short customer-support replies. For a given prompt x, the model must generate exactly two tokens y1 y2 (then stop). The engineer shows you the model’s final-layer logits (before Softmax) for the next token at step 1, and then (depending on the chosen y1) the logits for step 2:

Step 1 logits over the vocabulary {A, B}: u(A)=0, u(B)=0.

If y1=A, then Step 2 logits over {C, D}: u(C)=10, u(D)=0.
If y1=B, then Step 2 logits over {C, D}: u(C)=1, u(D)=1.

The engineer claims: “Because A and B are equally likely at step 1, greedy decoding is fine; it will pick either A or B, and then we’ll get the best overall two-token completion anyway.”

Write an analysis that (1) computes the relevant next-token probabilities using Softmax at each step, (2) uses the autoregressive decomposition to compute and compare the total conditional log-probability log Pr(y1,y2|x) for the best completion under y1=A versus under y1=B, and (3) explains—using the log-likelihood/sequence-scoring perspective—whether the engineer’s claim is correct and why. Your answer should make clear how per-step conditional probabilities interact to determine the best overall sequence under the argmax objective.

Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability

You are reviewing an internal evaluation script for an autoregressive LLM used to draft customer-support replies. The script is supposed to (a) compute the total conditional log-probability of a candidate reply y given a prompt x, and (b) explain why the model preferred one reply over another.

A teammate reports a suspicious pattern: for many prompts, the script claims the model assigns extremely high probability to the next token (often >0.99) and therefore very high total sequence probability, even when the chosen reply is clearly worse. You inspect the code and find two implementation choices:

1) At each position i, it takes the model’s logits vector u^(i) over the vocabulary and converts it to “probabilities” by dividing each logit by the sum of logits (i.e., p_k = u_k / sum_j u_j), without exponentiating.
2) To score a full candidate reply y = (y_1,...,y_n), it multiplies the per-token probabilities across positions to get Pr(y|x), and then takes log at the end (i.e., log(∏_i Pr(y_i|x,y_<i))).

Write an analysis explaining, in a way that a software engineer could act on, why these choices can produce misleading next-token probabilities and incorrect sequence comparisons. Your answer must:
- Use the correct mathematical objective for inference-time sequence selection (argmax over log Pr(y|x)) and its autoregressive decomposition.
- Explain the role of softmax in turning logits into a valid conditional next-token distribution Pr(y_i|x,y_<i), and what goes wrong when you “normalize logits” directly.
- Connect the per-token conditional probabilities to the log-likelihood-style sum used for stable sequence scoring, and explain why the sum of log-probabilities is the standard computation.
- Propose a corrected scoring approach (in words and/or formulas) that would let the team reliably compare two candidate replies of different lengths without numerical issues.

Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring

You are reviewing a production LLM feature that ranks two candidate continuations for the same user prompt x by computing a score for each full continuation y and choosing the higher-scoring one. The intended scoring rule is to choose $\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \Pr(\mathbf{y}\mid\mathbf{x})$, using the autoregressive decomposition $\log \Pr(\mathbf{y}\mid\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i\mid\mathbf{x}, \mathbf{y}_{<i})$. The model produces next-token logits at each step, which should be converted to probabilities with softmax over the full vocabulary.

A teammate implemented the scorer as follows: at each step i, they take the logit for the candidate’s next token $u_{y_i}$ and subtract $\log\big(\sum_{t \in S_i} \exp(u_t)\big)$, where $S_i$ is a small, request-specific shortlist of 50 tokens (not the full vocabulary). They then sum these per-token values across the continuation to get the sequence score.

In an A/B test, the system starts preferring verbose, low-quality continuations that repeat common tokens, even when the model’s raw logits for the “good” continuation look higher at several positions.

As the on-call ML engineer, analyze whether the teammate’s scoring method is mathematically consistent with maximizing $\log \Pr(\mathbf{y}\mid\mathbf{x})$ under an autoregressive language model. In your answer, explain (1) how softmax normalization affects the conditional next-token probability $\Pr(y_i\mid\mathbf{x},\mathbf{y}_{<i})$, (2) why using a changing shortlist $S_i$ can distort the summed log-likelihood comparison between two full sequences, and (3) what concrete change you would make to the scoring computation to correctly rank candidates by $\log \Pr(\mathbf{y}\mid\mathbf{x})$.

Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability

You are building an internal evaluation service that ranks multiple candidate completions produced by an autoregressive LLM for the same prompt. The model returns, for each generation step i, a vector of logits u_i over the vocabulary V (one logit per token) computed from the context (prompt x plus previously generated tokens y_<i). Your service must (1) compute the conditional probability of each chosen token y_i from u_i, (2) compute the total conditional log-probability log Pr(y|x) for the full candidate sequence y = (y_1,...,y_n) using the autoregressive decomposition, and (3) return the best candidate y_hat = argmax_y log Pr(y|x). 

Create a precise, implementation-ready specification (math + clear pseudocode) for a function score_and_select(prompt_tokens x, candidates Y, logits_by_candidate U) that returns (best_candidate, scores). Your spec must explicitly show: how Softmax converts logits to next-token probabilities; how you extract Pr(y_i|x,y_<i) for the actually generated token at each step; how you aggregate across steps into a single sequence score consistent with the inference objective; and how this relates to the log-likelihood objective used in training (i.e., what quantity training maximizes that your scorer is computing at inference). Assume <BOS> is a fixed start token with probability 1 and candidates may have different lengths; include how you handle length in the score (e.g., stop at <EOS> if present).

Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs

You are reviewing an internal evaluation script for an autoregressive LLM used to rank two candidate completions for the same prompt x. The script is supposed to choose the completion y that maximizes the conditional log-probability log Pr(y|x), computed as a sum of next-token log-probabilities. However, the script’s author claims they can compare candidates by summing the *raw logits* (pre-softmax scores) of the chosen tokens at each position, because “softmax is monotonic so it won’t change the ranking.”

In one example, the model produces the following logits over a 3-token vocabulary {A, B, C} at each generation step (higher logit = higher score). Candidate 1 is y^(1) = [A, A]; Candidate 2 is y^(2) = [B, B].

Step 1 logits given x:
- u(A)=10, u(B)=9, u(C)=0

Step 2 logits given x and the first generated token:
- if the first token was A: u(A)=0, u(B)=0, u(C)=0
- if the first token was B: u(A)=8, u(B)=7, u(C)=0

The script’s current scoring method sums the selected-token logits across steps (e.g., score(y)=u(y1)+u(y2) using the appropriate conditional logits at step 2).

As the reviewer, determine which candidate should be selected under the *correct* inference objective, and explain why the “sum of logits” method can produce a different ranking in this case. Your explanation must explicitly connect (1) autoregressive decomposition into next-token conditionals, (2) softmax’s role in turning logits into probabilities, and (3) why training/inference use log-likelihood (log-probability) rather than raw logits for sequence scoring.

Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score

You are reviewing an internal evaluation report for a customer-support LLM. The report claims the model would prefer Completion A over Completion B for the same prompt because “A has higher probability.” You suspect the analyst mixed up logits, probabilities, and sequence scoring.

Using ONLY the information below, determine which completion is actually more likely under the model (i.e., has higher conditional log-probability given the prompt), and briefly explain the reasoning steps you used (including how softmax, next-token conditional probabilities, and autoregressive decomposition combine into a single sequence score).

Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability

The derivative of the softmax cross-entropy loss with respect to any unnormalized logit $$o_j$$ reveals an elegant and intuitive result:

$$\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j$$

This derivative is exactly the difference between the conditional probability assigned to the class by the model's softmax operation and the actual observation recorded in the one-hot label vector $$\mathbf{y}$$. This elegant property is characteristic of any exponential family model and makes computing gradients for backpropagation extremely straightforward.

Derivative of Softmax Cross-Entropy Loss with Respect to Logits

When calculating the softmax function, $$\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$$, computational risks arise due to exponential operations. If some input logits, $$o_k$$, are very large positive numbers, computing $$\exp(o_k)$$ can produce values that exceed the maximum limit of certain data types (such as $$10^{38}$$ for single-precision floating-point numbers). This phenomenon is known as numerical overflow, and it leads to mathematical instability because the resulting predicted probabilities become undefined.

Numerical Overflow in Softmax Function

The batched softmax function applies the softmax operation row-wise to map a matrix of raw scalar outputs, denoted as $$\mathbf{X}$$, into a matrix where each row represents a valid probability distribution. It transforms each element into a non-negative number and ensures that each row sums to 1. Computing this row-wise softmax involves three steps: first, exponentiating each element of the matrix; second, computing a normalization constant for each row by summing its exponentiated elements; and third, dividing each element by its corresponding row's normalization constant. The mathematical formula is expressed as: $$\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}$$

Learn Before

Related