Learn Before
Example of MLM Training Objective with Multiple Masks
To illustrate the Masked Language Modeling (MLM) training objective with multiple masked tokens, consider the original sequence the early bird catches the worm''. If the tokens early'' at position and ``worm'' at position are masked, the objective is to maximize the sum of log-scale probabilities for correctly predicting these two tokens. Given the corrupted input , where , the loss function to maximize is:

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability of a True Token in MLM
Predicted Probability Distribution in MLM
Example of MLM Training Objective with Multiple Masks
MLM Loss Function as Negative Log-Likelihood
A language model is being trained to fill in a masked word. For the input 'The cat sat on the [MASK]', the correct word is 'mat'. The training objective is to adjust the model to minimize the cross-entropy loss for its predictions. Below are four different potential outputs from the model, showing the probability it assigns to the word 'mat'. Which of these outputs would result in the LOWEST loss for this specific training example?
Evaluating Model Performance via Cross-Entropy Loss
According to the standard Masked Language Modeling (MLM) training objective, a model's parameters are adjusted based on the cross-entropy loss calculated for a single, strategically chosen masked token within a training batch, aiming to optimize performance on that specific prediction.
Learn After
A language model is being trained using a masked language modeling objective. The original input sentence is 'A quick brown fox jumps over the lazy dog'. During a training step, the tokens 'quick' (at position 2) and 'lazy' (at position 8) are masked. The model receives the corrupted input, denoted as : '[CLS] A [MASK] brown fox jumps over the [MASK] dog'. Which of the following mathematical expressions correctly represents the training objective for this specific step, which the model aims to maximize?
A language model is being trained on a sentence where two words have been replaced with a special [MASK] token. The training objective is to maximize the sum of the log-probabilities of the original words at these two masked positions. Why is the objective formulated as a sum of log-probabilities rather than, for example, a product of the probabilities?
Evaluating Model Performance in MLM Training