Formula

MLM Training Objective using Cross-Entropy Loss

The training objective for Masked Language Modeling (MLM) involves finding the optimal model parameters, W^\widehat{\mathbf{W}} and θ^\hat{\theta}, that minimize the total cross-entropy loss over a given dataset D\mathcal{D}. For each modified text sequence xˉ\bar{\mathbf{x}}, the loss is computed only for the set of selected positions A(x)\mathcal{A}(\mathbf{x}) by comparing the model's predicted probability distribution piW,θ\mathbf{p}_{i}^{\mathbf{W},\theta} with the ground-truth distribution pigold\mathbf{p}_{i}^{\mathrm{gold}} at each selected position ii. The complete optimization objective is formulated as: (W^,θ^)=argminW,θxDiA(x)LogCrossEntropy(piW,θ,pigold)(\widehat{\mathbf{W}},\hat{\theta}) = \arg\min_{\mathbf{W},\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i \in \mathcal{A}(\mathbf{x})} \mathrm{LogCrossEntropy}(\mathbf{p}_{i}^{\mathbf{W},\theta},\mathbf{p}_{i}^{\mathrm{gold}})

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related