A language model is being trained with the objective of maximizing the log-likelihood of sequences. For the specific sequence 'The cat sat', the model computes the following conditional log-probabilities for the actual next token at each position (assuming a fixed start-of-sequence token):

*   `log Pr('The' | start_token) = -1.5`
*   `log Pr('cat' | start_token, 'The') = -0.8`
*   `log Pr('sat' | start_token, 'The', 'cat') = -1.2`

Calculate the total log-likelihood for this entire sequence, which represents the value the model aims to maximize for this training example.

Google

The log-likelihood of a sequence $\mathbf{x} = (x_0, \dots, x_m)$ is computed by summing the log-probabilities of predicting each token given its predecessors. This method is derived from the chain rule of probability. The full expression is $\log \text{Pr}(\mathbf{x}) = \log \text{Pr}(x_0) + \sum_{j=1}^{m} \log \text{Pr}(x_j|\mathbf{x}_{<j})$. For practical training purposes, the probability of the initial token, $\text{Pr}(x_0)$, is often assumed to be 1 (making its log-probability 0), especially when it's a fixed start-of-sequence symbol. This simplifies the objective to summing only the conditional log-probabilities for the remaining tokens: $$ \mathcal{L}_{\theta}(\mathbf{x}) = \sum_{j=1}^{m} \log \text{Pr}_{\theta}(x_j|\mathbf{x}_{<j}) $$ In short, the process involves calculating the token prediction log-probability at each position in the sequence and then adding these values together.

Log-Likelihood Objective for Language Model Training

Calculating Sequence Log-Likelihood

A language model is being trained on the sentence '<BOS> The cat sat'. The model calculates the following conditional log-probabilities at each step, where '<BOS>' is a fixed start-of-sequence token:

- `log P('The' | '<BOS>') = -1.5`
- `log P('cat' | '<BOS>', 'The') = -0.9`
- `log P('sat' | '<BOS>', 'The', 'cat') = -1.2`

Based on the standard training objective for this single sequence, what is the total log-likelihood value that the model aims to maximize?

A language model is trained by adjusting its parameters to maximize the log-likelihood of sequences in its training data. After training, it evaluates two possible continuations for the prefix 'The cat sat on the...'. Based on the conditional log-probabilities provided below, which continuation better aligns with the model's training objective, and why?

Model Output Evaluation

You’re reviewing an internal evaluation script tha...

Your team is building an internal tool that ranks ...

You’re reviewing an internal LLM evaluation pipeli...

You are reviewing an internal incident report: a product team claims their LLM “should have generated” a particular 3-token continuation y = (y1, y2, y3) after a prompt x because, at each step, the model assigned the highest next-token probability to the token that appears in that continuation. Another team counters that the correct inference objective is to choose the continuation that maximizes the conditional probability of the entire sequence given x, and that this can disagree with stepwise top-1 choices.

Write an analysis that (1) states the mathematical inference objective for selecting an output sequence given x, (2) decomposes that objective autoregressively into next-token conditional probabilities, (3) explains how the model obtains each next-token probability from logits using softmax, and (4) connects this to the training objective by explaining how maximizing log-likelihood over data relates to (but does not guarantee) greedy stepwise selection at inference. Your answer should explicitly use log-probabilities (sum of logs) to justify why “highest at each step” is not the same claim as “highest total sequence probability,” and should include at least one concrete numeric mini-example (you may invent numbers) showing how two different 3-token continuations can lead to this disagreement.

Reconciling Training Log-Likelihood with Inference-Time Sequence Selection

You are reviewing an internal demo of an autoregressive LLM used to draft short customer-support replies. For a given prompt x, the model must generate exactly two tokens y1 y2 (then stop). The engineer shows you the model’s final-layer logits (before Softmax) for the next token at step 1, and then (depending on the chosen y1) the logits for step 2:

Step 1 logits over the vocabulary {A, B}: u(A)=0, u(B)=0.

If y1=A, then Step 2 logits over {C, D}: u(C)=10, u(D)=0.
If y1=B, then Step 2 logits over {C, D}: u(C)=1, u(D)=1.

The engineer claims: “Because A and B are equally likely at step 1, greedy decoding is fine; it will pick either A or B, and then we’ll get the best overall two-token completion anyway.”

Write an analysis that (1) computes the relevant next-token probabilities using Softmax at each step, (2) uses the autoregressive decomposition to compute and compare the total conditional log-probability log Pr(y1,y2|x) for the best completion under y1=A versus under y1=B, and (3) explains—using the log-likelihood/sequence-scoring perspective—whether the engineer’s claim is correct and why. Your answer should make clear how per-step conditional probabilities interact to determine the best overall sequence under the argmax objective.

Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability

You are reviewing an internal evaluation script for an autoregressive LLM used to draft customer-support replies. The script is supposed to (a) compute the total conditional log-probability of a candidate reply y given a prompt x, and (b) explain why the model preferred one reply over another.

A teammate reports a suspicious pattern: for many prompts, the script claims the model assigns extremely high probability to the next token (often >0.99) and therefore very high total sequence probability, even when the chosen reply is clearly worse. You inspect the code and find two implementation choices:

1) At each position i, it takes the model’s logits vector u^(i) over the vocabulary and converts it to “probabilities” by dividing each logit by the sum of logits (i.e., p_k = u_k / sum_j u_j), without exponentiating.
2) To score a full candidate reply y = (y_1,...,y_n), it multiplies the per-token probabilities across positions to get Pr(y|x), and then takes log at the end (i.e., log(∏_i Pr(y_i|x,y_<i))).

Write an analysis explaining, in a way that a software engineer could act on, why these choices can produce misleading next-token probabilities and incorrect sequence comparisons. Your answer must:
- Use the correct mathematical objective for inference-time sequence selection (argmax over log Pr(y|x)) and its autoregressive decomposition.
- Explain the role of softmax in turning logits into a valid conditional next-token distribution Pr(y_i|x,y_<i), and what goes wrong when you “normalize logits” directly.
- Connect the per-token conditional probabilities to the log-likelihood-style sum used for stable sequence scoring, and explain why the sum of log-probabilities is the standard computation.
- Propose a corrected scoring approach (in words and/or formulas) that would let the team reliably compare two candidate replies of different lengths without numerical issues.

Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring

You are reviewing a production LLM feature that ranks two candidate continuations for the same user prompt x by computing a score for each full continuation y and choosing the higher-scoring one. The intended scoring rule is to choose $\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \Pr(\mathbf{y}\mid\mathbf{x})$, using the autoregressive decomposition $\log \Pr(\mathbf{y}\mid\mathbf{x}) = \sum_{i=1}^{n} \log \Pr(y_i\mid\mathbf{x}, \mathbf{y}_{<i})$. The model produces next-token logits at each step, which should be converted to probabilities with softmax over the full vocabulary.

A teammate implemented the scorer as follows: at each step i, they take the logit for the candidate’s next token $u_{y_i}$ and subtract $\log\big(\sum_{t \in S_i} \exp(u_t)\big)$, where $S_i$ is a small, request-specific shortlist of 50 tokens (not the full vocabulary). They then sum these per-token values across the continuation to get the sequence score.

In an A/B test, the system starts preferring verbose, low-quality continuations that repeat common tokens, even when the model’s raw logits for the “good” continuation look higher at several positions.

As the on-call ML engineer, analyze whether the teammate’s scoring method is mathematically consistent with maximizing $\log \Pr(\mathbf{y}\mid\mathbf{x})$ under an autoregressive language model. In your answer, explain (1) how softmax normalization affects the conditional next-token probability $\Pr(y_i\mid\mathbf{x},\mathbf{y}_{<i})$, (2) why using a changing shortlist $S_i$ can distort the summed log-likelihood comparison between two full sequences, and (3) what concrete change you would make to the scoring computation to correctly rank candidates by $\log \Pr(\mathbf{y}\mid\mathbf{x})$.

Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability

You are building an internal evaluation service that ranks multiple candidate completions produced by an autoregressive LLM for the same prompt. The model returns, for each generation step i, a vector of logits u_i over the vocabulary V (one logit per token) computed from the context (prompt x plus previously generated tokens y_<i). Your service must (1) compute the conditional probability of each chosen token y_i from u_i, (2) compute the total conditional log-probability log Pr(y|x) for the full candidate sequence y = (y_1,...,y_n) using the autoregressive decomposition, and (3) return the best candidate y_hat = argmax_y log Pr(y|x). 

Create a precise, implementation-ready specification (math + clear pseudocode) for a function score_and_select(prompt_tokens x, candidates Y, logits_by_candidate U) that returns (best_candidate, scores). Your spec must explicitly show: how Softmax converts logits to next-token probabilities; how you extract Pr(y_i|x,y_<i) for the actually generated token at each step; how you aggregate across steps into a single sequence score consistent with the inference objective; and how this relates to the log-likelihood objective used in training (i.e., what quantity training maximizes that your scorer is computing at inference). Assume <BOS> is a fixed start token with probability 1 and candidates may have different lengths; include how you handle length in the score (e.g., stop at <EOS> if present).

Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs

You are reviewing an internal evaluation script for an autoregressive LLM used to rank two candidate completions for the same prompt x. The script is supposed to choose the completion y that maximizes the conditional log-probability log Pr(y|x), computed as a sum of next-token log-probabilities. However, the script’s author claims they can compare candidates by summing the *raw logits* (pre-softmax scores) of the chosen tokens at each position, because “softmax is monotonic so it won’t change the ranking.”

In one example, the model produces the following logits over a 3-token vocabulary {A, B, C} at each generation step (higher logit = higher score). Candidate 1 is y^(1) = [A, A]; Candidate 2 is y^(2) = [B, B].

Step 1 logits given x:
- u(A)=10, u(B)=9, u(C)=0

Step 2 logits given x and the first generated token:
- if the first token was A: u(A)=0, u(B)=0, u(C)=0
- if the first token was B: u(A)=8, u(B)=7, u(C)=0

The script’s current scoring method sums the selected-token logits across steps (e.g., score(y)=u(y1)+u(y2) using the appropriate conditional logits at step 2).

As the reviewer, determine which candidate should be selected under the *correct* inference objective, and explain why the “sum of logits” method can produce a different ranking in this case. Your explanation must explicitly connect (1) autoregressive decomposition into next-token conditionals, (2) softmax’s role in turning logits into probabilities, and (3) why training/inference use log-likelihood (log-probability) rather than raw logits for sequence scoring.

Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score

You are reviewing an internal evaluation report for a customer-support LLM. The report claims the model would prefer Completion A over Completion B for the same prompt because “A has higher probability.” You suspect the analyst mixed up logits, probabilities, and sequence scoring.

Using ONLY the information below, determine which completion is actually more likely under the model (i.e., has higher conditional log-probability given the prompt), and briefly explain the reasoning steps you used (including how softmax, next-token conditional probabilities, and autoregressive decomposition combine into a single sequence score).

Learn Before

Related