Learn Before
Formula

Loss Function for Language Modeling

To train a language model, such as a decoder-only architecture, the standard approach is to minimize a loss function over a collection of token sequences. This function, denoted as L(pi+1Īø,pi+1gold)\mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}}), measures the discrepancy between the model's predicted probability distribution pi+1Īø\mathbf{p}_{i+1}^{\theta} and the true, gold-standard distribution pi+1gold\mathbf{p}_{i+1}^{\mathrm{gold}} at each position. In natural language processing, this difference is typically quantified using the log-scale cross-entropy loss.

Image 0

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences