1Cademy - Loss Function for Language Modeling

Prediction A: The model assigns a probability of 0.75 to the correct word, &#x27;brightly&#x27;.
Prediction B: The model assigns a probability of 0.15 to the correct word, &#x27;brightly&#x27;.

Learn Before

Text Generation Probability
Cross-entropy loss

Formula

Loss Function for Language Modeling

To train a language model, such as a decoder-only architecture, the standard approach is to minimize a loss function over a collection of token sequences. This function, denoted as $\mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}})$ , measures the discrepancy between the model's predicted probability distribution $\mathbf{p}_{i+1}^{\theta}$ and the true, gold-standard distribution $\mathbf{p}_{i+1}^{\mathrm{gold}}$ at each position. In natural language processing, this difference is typically quantified using the log-scale cross-entropy loss.