Concept

Training Decoder-Only Language Models with Cross-Entropy Loss

Training a decoder-only language model, represented as Decoderθ()\mathrm{Decoder}_{\theta}(\cdot), involves optimizing its parameters θ\theta by minimizing a loss function across a collection of token sequences. At each token position ii, the difference between the model's predicted probability distribution for the next token (pi+1θ\mathbf{p}_{i+1}^{\theta}) and the true prediction or gold-standard distribution (pi+1gold\mathbf{p}_{i+1}^{\mathrm{gold}}) is quantified using a loss function, denoted as L(pi+1θ,pi+1gold)\mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}}). In natural language processing, the standard choice for this loss calculation is the log-scale cross-entropy loss.

Image 0

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models