MLM Loss Function as Negative Log-Likelihood
The loss function for Masked Language Modeling (MLM) calculates the negative log-likelihood of correctly predicting the original tokens at their masked positions. Given a token sequence with a set of selected positions , and its modified version , the MLM loss is formulated as:
In this equation, represents the probability of accurately predicting the original token at the position given the modified input sequence .

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability of a True Token in MLM
Predicted Probability Distribution in MLM
Example of MLM Training Objective with Multiple Masks
MLM Loss Function as Negative Log-Likelihood
A language model is being trained to fill in a masked word. For the input 'The cat sat on the [MASK]', the correct word is 'mat'. The training objective is to adjust the model to minimize the cross-entropy loss for its predictions. Below are four different potential outputs from the model, showing the probability it assigns to the word 'mat'. Which of these outputs would result in the LOWEST loss for this specific training example?
Evaluating Model Performance via Cross-Entropy Loss
According to the standard Masked Language Modeling (MLM) training objective, a model's parameters are adjusted based on the cross-entropy loss calculated for a single, strategically chosen masked token within a training batch, aiming to optimize performance on that specific prediction.
MLM Loss Function as Negative Log-Likelihood
A neural network is trained on a 4-class classification task. For a single training example where the true class is the second class, the model outputs the probability vector
[0.1, 0.7, 0.1, 0.1]. The loss for this example is calculated as-log(0.7). This loss function can be interpreted as a measure of divergence between two probability distributions. What are these two distributions?Interpreting Negative Log-Likelihood as Cross-Entropy
A neural network is being trained for a 3-class classification task (Classes A, B, C). For a single training example, the true label is 'Class B'. The model outputs the probability distribution
P(A)=0.2, P(B)=0.5, P(C)=0.3. The loss for this example is calculated using the negative log-likelihood of the correct class, resulting in a loss of-log(0.5). This calculation is a direct application of the cross-entropy formula between the model's predicted distribution and the empirical distribution from the training data. What is the specific empirical probability distribution for this single training example?
Learn After
A language model is given an input sequence where one token has been replaced by a [MASK] token. The original, correct token for that position was 'fox'. After processing the input, the model outputs the following probability distribution for the masked position:
- P('fox') = 0.7
- P('cat') = 0.2
- P('dog') = 0.1
If the training objective for this single token is to minimize the negative natural logarithm of the probability of the correct token, what is the calculated loss value for this instance? (Use ln for natural logarithm)
Two language models, Model A and Model B, are tasked with predicting a masked token in a sentence. The correct, original token is 'river'.
Model A's predicted probabilities for the masked position include:
- P('river') = 0.3
- P('stream') = 0.4
- P('water') = 0.2
Model B's predicted probabilities for the masked position include:
- P('river') = 0.01
- P('mountain') = 0.95
- P('sky') = 0.02
Based on the standard negative log-likelihood loss function used for this task, which statement accurately compares the calculated loss for this single prediction?
Calculating Total MLM Loss for a Sequence
Running Example of Computing MLM Loss