1Cademy - MLM Loss Function as Negative Log-Likelihood

Learn Before

MLM Training Objective using Cross-Entropy Loss
A Broad Definition of Cross Entropy

Formula

MLM Loss Function as Negative Log-Likelihood

The loss function for Masked Language Modeling (MLM) calculates the negative log-likelihood of correctly predicting the original tokens at their masked positions. Given a token sequence $\mathbf{x}$ with a set of selected positions $\mathcal{A}(\mathbf{x})$ , and its modified version $\bar{\mathbf{x}}$ , the MLM loss is formulated as:

$\mathrm{Loss}_{\mathrm{MLM}} = - \sum_{i \in \mathcal{A}(\mathbf{x})} \log \mathrm{Pr}_i(x_i|\bar{\mathbf{x}})$

In this equation, $\mathrm{Pr}_i(x_i|\bar{\mathbf{x}})$ represents the probability of accurately predicting the original token $x_i$ at the position $i$ given the modified input sequence $\bar{\mathbf{x}}$ .