1Cademy - MLM Training Objective as Maximum Likelihood Estimation

Learn Before

Training Objective of Masked Language Modeling (MLM)

Formula

MLM Training Objective as Maximum Likelihood Estimation

The simplified training objective for Masked Language Modeling, which only maximizes the probabilities for the selected tokens, can be expressed in a maximum likelihood estimation fashion. To optimize the model, we seek parameters $(\widehat{\mathbf{W}},\hat{\theta})$ that maximize the log-probability of correctly predicting the original tokens at their selected positions. For a dataset $\mathcal{D}$ containing original sequences $\mathbf{x}$ , and where $\mathcal{A}(\mathbf{x})$ represents the set of selected positions in the modified sequence $\bar{\mathbf{x}}$ , the objective function is: $(\widehat{\mathbf{W}},\hat{\theta}) = \arg\max_{\mathbf{W},\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i \in \mathcal{A}(\mathbf{x})} \log \mathrm{Pr}_{i}^{\mathbf{W},\theta}(x_{i}/\bar{\mathbf{x}})$

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After