Formula

Total Loss Calculation for a Token Sequence

The total loss for a given sequence of mm tokens (x0,,xm)(x_0, \dots, x_m) is computed by summing the individual losses over each position from i=0i=0 to m1m-1. At each position ii, a loss function L\mathcal{L} measures the discrepancy between the model's predicted probability distribution for the next token (pi+1θ\mathbf{p}_{i+1}^{\theta}) and the actual ground-truth distribution (pi+1gold\mathbf{p}_{i+1}^{\mathrm{gold}}). This is expressed generally as:

Lossθ(x0,,xm)=i=0m1L(pi+1θ,pi+1gold)\mathrm{Loss}_{\theta}(x_0, \dots, x_m) = \sum_{i=0}^{m-1} \mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}})

In natural language processing, this loss function L\mathcal{L} is typically the log-scale cross-entropy loss, leading to the specific formula:

Lossθ(x0,,xm)=i=0m1LogCrossEntropy(pi+1θ,pi+1gold)\mathrm{Loss}_{\theta}(x_0, \dots, x_m) = \sum_{i=0}^{m-1} \mathrm{LogCrossEntropy}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}})

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.3 Prompting - Foundations of Large Language Models

Related