Formula

Log-Likelihood Objective for Language Model Training

The log-likelihood of a sequence x=(x0,,xm)\mathbf{x} = (x_0, \dots, x_m) is computed by summing the log-probabilities of predicting each token given its predecessors. This method is derived from the chain rule of probability. The full expression is logPr(x)=logPr(x0)+j=1mlogPr(xjx<j)\log \text{Pr}(\mathbf{x}) = \log \text{Pr}(x_0) + \sum_{j=1}^{m} \log \text{Pr}(x_j|\mathbf{x}_{<j}). For practical training purposes, the probability of the initial token, Pr(x0)\text{Pr}(x_0), is often assumed to be 1 (making its log-probability 0), especially when it's a fixed start-of-sequence symbol. This simplifies the objective to summing only the conditional log-probabilities for the remaining tokens: Lθ(x)=j=1mlogPrθ(xjx<j)\mathcal{L}_{\theta}(\mathbf{x}) = \sum_{j=1}^{m} \log \text{Pr}_{\theta}(x_j|\mathbf{x}_{<j}) In short, the process involves calculating the token prediction log-probability at each position in the sequence and then adding these values together.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related