Learn Before
Probability of a True Token in MLM
In Masked Language Modeling (MLM), the mathematical expression denotes the probability of predicting the correct, true token at a specific position , given the corrupted input sequence . This conditional probability depends on the model's learned parameters, represented by the weights and .

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability of a True Token in MLM
Predicted Probability Distribution in MLM
Example of MLM Training Objective with Multiple Masks
MLM Loss Function as Negative Log-Likelihood
A language model is being trained to fill in a masked word. For the input 'The cat sat on the [MASK]', the correct word is 'mat'. The training objective is to adjust the model to minimize the cross-entropy loss for its predictions. Below are four different potential outputs from the model, showing the probability it assigns to the word 'mat'. Which of these outputs would result in the LOWEST loss for this specific training example?
Evaluating Model Performance via Cross-Entropy Loss
According to the standard Masked Language Modeling (MLM) training objective, a model's parameters are adjusted based on the cross-entropy loss calculated for a single, strategically chosen masked token within a training batch, aiming to optimize performance on that specific prediction.
Learn After
A masked language model is given the input sequence: 'The quick brown [MASK] jumps over the lazy dog.' The original, unmasked token at the
[MASK]position was 'fox'. Two different versions of the model, Model A and Model B, are used to predict the masked token.- Model A assigns a probability of 0.85 to the token 'fox'.
- Model B assigns a probability of 0.15 to the token 'fox', and its highest predicted probability is 0.40 for the token 'cat'.
Based on the probability assigned to the correct, original token, which of the following statements provides the most accurate analysis of the models' performance on this specific example?
Analyzing Model Learning via Token Probability
A language model is being trained on the task of filling in masked words. At an early stage of training, for the sentence 'The sun rises in the [MASK]', the model assigns a probability of 0.05 to the correct word 'east'. After many more rounds of successful training on a large dataset, the model is presented with the same masked sentence. Which of the following outcomes is the most plausible and directly reflects the objective of this training process?