MLM Training Objective as Maximum Likelihood Estimation
The simplified training objective for Masked Language Modeling, which only maximizes the probabilities for the selected tokens, can be expressed in a maximum likelihood estimation fashion. To optimize the model, we seek parameters that maximize the log-probability of correctly predicting the original tokens at their selected positions. For a dataset containing original sequences , and where represents the set of selected positions in the modified sequence , the objective function is:

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
MLM Training Objective using Cross-Entropy Loss
MLM Training Objective as Maximum Likelihood Estimation
A language model is being trained using a masked language modeling objective. The input is a sentence where some words have been replaced with a
[MASK]token. While the high-level goal is to enable the model to reconstruct the original sentence from this corrupted input, the practical training objective is more specific. Which statement best analyzes the actual, simplified objective the model optimizes during training and the reason for this simplification?Evaluating an MLM Training Implementation
During the training of a language model with a masked language modeling objective, the model is optimized to predict the entire original text sequence, including the tokens that were not masked, from the corrupted input.
Learn After
MLM Training Objective using Cross-Entropy Loss
In the context of training a language model, the objective is often to find parameters that maximize the likelihood of the training data. Consider the following mathematical expression for this objective:
Objective = ∑_{x ∈ D} ∑_{i ∈ A(x)} log Pr(xᵢ | x̄)Here,
Dis the dataset,xis an original text sequence,x̄is a version ofxwith some tokens masked,A(x)is the set of indices that were masked inx, andxᵢis the original token at a masked positioni.What does the inner summation,
∑_{i ∈ A(x)} log Pr(xᵢ | x̄), represent in this training process?Calculating Contribution to MLM Training Objective
A language model is being trained with the objective of maximizing the log-probability of the original tokens at masked positions. For the original sentence 'The fox jumps over the dog', the model is given the masked input 'The fox [MASK] over the dog'. Which of the following model predictions for the
[MASK]token would contribute the most to achieving the training objective for this specific instance?Example of Masked Language Modeling Loss Calculation