Log-Likelihood Objective for Language Model Training
The log-likelihood of a sequence is computed by summing the log-probabilities of predicting each token given its predecessors. This method is derived from the chain rule of probability. The full expression is . For practical training purposes, the probability of the initial token, , is often assumed to be 1 (making its log-probability 0), especially when it's a fixed start-of-sequence symbol. This simplifies the objective to summing only the conditional log-probabilities for the remaining tokens: In short, the process involves calculating the token prediction log-probability at each position in the sequence and then adding these values together.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Log-Likelihood Objective for Language Model Training
Formulating the MLE Objective for a Small Dataset
Total Loss Calculation for a Token Sequence
A model is being trained on a dataset containing just two sequences:
seq_1 = (x_0, x_1)andseq_2 = (y_0, y_1, y_2). According to the principle of maximum likelihood estimation for sequential data, which expression correctly represents the decomposed log-probability that the model aims to maximize for this entire dataset?When training a model on a sequence of data using the Maximum Likelihood Estimation objective, a single prediction with a very low conditional probability for one element in the sequence can have a disproportionately large negative impact on the total log-probability calculated for that entire sequence.
Pre-trained Language Model Decoder Inference
Log-Probability of a Ranked Sequence
Log-Likelihood Objective for Language Model Training
A language model is generating a sequence of tokens. It has computed the following conditional log-probabilities for a three-token sequence, where each token's probability is dependent on the ones that came before it:
- Log-probability of the first token: -1.8
- Log-probability of the second token, given the first: -2.5
- Log-probability of the third token, given the first two: -1.2
Based on these values, what is the total log-likelihood of this entire three-token sequence?
Evaluating Sentence Plausibility
A language model has calculated the total log-likelihood for the sequence of tokens: ["The", "quick", "brown", "fox"]. The calculation involves summing the conditional log-probabilities of each token given the preceding ones. If the third token is changed from "brown" to "lazy", creating the new sequence ["The", "quick", "lazy", "fox"], which set of conditional log-probabilities must be re-calculated to find the new total log-likelihood?
Applying Log-Likelihood Calculation to a Training Dataset
Log-Likelihood Objective for Language Model Training
A language model calculates the joint probability of a sequence of tokens
(x_0, x_1, ..., x_m). The first token,x_0, is a special, deterministic start-of-sequence symbol. How does the nature of this specific first token typically affect the overall calculation of the sequence's joint probability?Calculating Sequence Probability with a Start Token
Analyzing a Language Model's Sequence Probability
Learn After
Calculating Sequence Log-Likelihood
A language model is being trained on the sentence '
The cat sat'. The model calculates the following conditional log-probabilities at each step, where ' ' is a fixed start-of-sequence token: log P('The' | '<BOS>') = -1.5log P('cat' | '<BOS>', 'The') = -0.9log P('sat' | '<BOS>', 'The', 'cat') = -1.2
Based on the standard training objective for this single sequence, what is the total log-likelihood value that the model aims to maximize?
Model Output Evaluation
You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability