Example of Masked Language Modeling Loss Calculation
To illustrate the objective of maximizing log-scale probabilities in Masked Language Modeling, consider the original sequence 'The early bird catches the worm' where two tokens are masked. The corrupted input is . The objective is to maximize the sum of the log-probabilities for predicting the true tokens 'early' () and 'worm' () given this corrupted input. This is formally expressed as: .
0
1
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
MLM Training Objective using Cross-Entropy Loss
In the context of training a language model, the objective is often to find parameters that maximize the likelihood of the training data. Consider the following mathematical expression for this objective:
Objective = ∑_{x ∈ D} ∑_{i ∈ A(x)} log Pr(xᵢ | x̄)Here,
Dis the dataset,xis an original text sequence,x̄is a version ofxwith some tokens masked,A(x)is the set of indices that were masked inx, andxᵢis the original token at a masked positioni.What does the inner summation,
∑_{i ∈ A(x)} log Pr(xᵢ | x̄), represent in this training process?Calculating Contribution to MLM Training Objective
A language model is being trained with the objective of maximizing the log-probability of the original tokens at masked positions. For the original sentence 'The fox jumps over the dog', the model is given the masked input 'The fox [MASK] over the dog'. Which of the following model predictions for the
[MASK]token would contribute the most to achieving the training objective for this specific instance?Example of Masked Language Modeling Loss Calculation