Formula

MLM Training Objective as Maximum Likelihood Estimation

The simplified training objective for Masked Language Modeling, which only maximizes the probabilities for the selected tokens, can be expressed in a maximum likelihood estimation fashion. To optimize the model, we seek parameters (W^,θ^)(\widehat{\mathbf{W}},\hat{\theta}) that maximize the log-probability of correctly predicting the original tokens at their selected positions. For a dataset D\mathcal{D} containing original sequences x\mathbf{x}, and where A(x)\mathcal{A}(\mathbf{x}) represents the set of selected positions in the modified sequence xˉ\bar{\mathbf{x}}, the objective function is: (W^,θ^)=argmaxW,θxDiA(x)logPriW,θ(xi/xˉ)(\widehat{\mathbf{W}},\hat{\theta}) = \arg\max_{\mathbf{W},\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i \in \mathcal{A}(\mathbf{x})} \log \mathrm{Pr}_{i}^{\mathbf{W},\theta}(x_{i}/\bar{\mathbf{x}})

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences