Learn Before
Predicted Probability Distribution in MLM
Within the Masked Language Modeling (MLM) framework, the mathematical notation represents the model's predicted probability distribution for a token at a given position . This distribution is calculated based on the corrupted input sequence and the model's trainable parameters and . During training, this predicted distribution is evaluated against the true token's distribution to compute the log cross-entropy loss.

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability of a True Token in MLM
Predicted Probability Distribution in MLM
Example of MLM Training Objective with Multiple Masks
MLM Loss Function as Negative Log-Likelihood
A language model is being trained to fill in a masked word. For the input 'The cat sat on the [MASK]', the correct word is 'mat'. The training objective is to adjust the model to minimize the cross-entropy loss for its predictions. Below are four different potential outputs from the model, showing the probability it assigns to the word 'mat'. Which of these outputs would result in the LOWEST loss for this specific training example?
Evaluating Model Performance via Cross-Entropy Loss
According to the standard Masked Language Modeling (MLM) training objective, a model's parameters are adjusted based on the cross-entropy loss calculated for a single, strategically chosen masked token within a training batch, aiming to optimize performance on that specific prediction.
Learn After
A masked language model processes the input 'The chef carefully seasoned the [MASK] before serving.' For the masked position, the model generates a probability distribution over its entire 30,000-word vocabulary. The word 'soup' is assigned a probability of 0.6, 'dish' is assigned 0.2, and the remaining probability is spread thinly across the other 29,998 words. If the original, unmasked word was 'soup', which of the following statements provides the most accurate analysis of this outcome?
Interpreting a Model's Output Distribution
A language model with a small vocabulary consisting of only four words ('cat', 'sat', 'on', 'mat') is given the input sequence 'the [MASK] sat on the mat'. The model's task is to predict the masked token. Which of the following options represents a valid predicted probability distribution for the masked position?