1Cademy - Training Decoder-Only Language Models with Cross-Entropy Loss

Learn Before

Decoder-Only Transformer as a Language Model

Concept

Training Decoder-Only Language Models with Cross-Entropy Loss

Training a decoder-only language model, represented as $\mathrm{Decoder}_{\theta}(\cdot)$ , involves optimizing its parameters $\theta$ by minimizing a loss function across a collection of token sequences. At each token position $i$ , the difference between the model's predicted probability distribution for the next token ( $\mathbf{p}_{i+1}^{\theta}$ ) and the true prediction or gold-standard distribution ( $\mathbf{p}_{i+1}^{\mathrm{gold}}$ ) is quantified using a loss function, denoted as $\mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}})$ . In natural language processing, the standard choice for this loss calculation is the log-scale cross-entropy loss.

0

1

Updated 2026-04-15

Contributors are:

Who are from:

References

Learn Before

Related

Learn After