A language model is being trained with the objective of maximizing the log-probability of the original tokens at masked positions. For the original sentence 'The fox jumps over the dog', the model is given the masked input 'The fox [MASK] over the dog'. Which of the following model predictions for the [MASK] token would contribute the most to achieving the training objective for this specific instance?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
MLM Training Objective using Cross-Entropy Loss
In the context of training a language model, the objective is often to find parameters that maximize the likelihood of the training data. Consider the following mathematical expression for this objective:
Objective = ∑_{x ∈ D} ∑_{i ∈ A(x)} log Pr(xᵢ | x̄)Here,
Dis the dataset,xis an original text sequence,x̄is a version ofxwith some tokens masked,A(x)is the set of indices that were masked inx, andxᵢis the original token at a masked positioni.What does the inner summation,
∑_{i ∈ A(x)} log Pr(xᵢ | x̄), represent in this training process?Calculating Contribution to MLM Training Objective
A language model is being trained with the objective of maximizing the log-probability of the original tokens at masked positions. For the original sentence 'The fox jumps over the dog', the model is given the masked input 'The fox [MASK] over the dog'. Which of the following model predictions for the
[MASK]token would contribute the most to achieving the training objective for this specific instance?Example of Masked Language Modeling Loss Calculation