1Cademy - Applying Log-Likelihood Calculation to a Training Dataset

Learn Before

Log-Likelihood of a Sequence

Formula

Applying Log-Likelihood Calculation to a Training Dataset

The log-likelihood of a sequence $\mathbf{x}$ is computed by aggregating the log-probabilities of each token conditioned on its preceding context. This sequence-level computation is formally expressed as $\mathcal{L}_{\theta}(\mathbf{x}) = \sum_{i=1}^{m} \log \mathrm{Pr}_{\theta}(x_i|x_0,...,x_{i-1})$ , where the subscript $\theta$ affixed to both $\mathcal{L}(\cdot)$ and $\mathrm{Pr}(\cdot)$ denotes the parameters of the language model. This metric provides a foundation for optimizing the model across a training dataset.

Updated 2026-04-19

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Maximum Likelihood Training Objective for a Dataset of Sequences
A language model is defined by the following table of conditional log-probabilities, where <s> is the start-of-sequence token and <eos> is the end-of-sequence token:

| Log-Probability | Value | |---|---| | log Pr(A | <s>) | -0.5 | | log Pr(B | <s>) | -1.5 | | log Pr(B | A) | -0.2 | | log Pr(A | B) | -1.0 | | log Pr(<eos> | A) | -2.0 | | log Pr(<eos> | B) | -0.1 |

Given a training dataset D containing two sequences:
- Sequence 1: (A, B, <eos>)
- Sequence 2: (B, A, <eos>)
Verifying Language Model Performance on a Small Dataset
You are tasked with evaluating a language model's performance on a dataset composed of multiple text sequences. Arrange the following steps in the correct logical order to compute the log-likelihood for each individual sequence in the dataset.

Learn Before

Related

Learn After