Learn Before
Running Example of Computing MLM Loss
A running example of BERT-style masked language modeling illustrates the process of computing the Masked Language Modeling loss, . This process begins by selecting a portion of the tokens, such as , to be masked or modified before calculating the loss based on the predicted probabilities for those specific positions.
0
1
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model is given an input sequence where one token has been replaced by a [MASK] token. The original, correct token for that position was 'fox'. After processing the input, the model outputs the following probability distribution for the masked position:
- P('fox') = 0.7
- P('cat') = 0.2
- P('dog') = 0.1
If the training objective for this single token is to minimize the negative natural logarithm of the probability of the correct token, what is the calculated loss value for this instance? (Use ln for natural logarithm)
Two language models, Model A and Model B, are tasked with predicting a masked token in a sentence. The correct, original token is 'river'.
Model A's predicted probabilities for the masked position include:
- P('river') = 0.3
- P('stream') = 0.4
- P('water') = 0.2
Model B's predicted probabilities for the masked position include:
- P('river') = 0.01
- P('mountain') = 0.95
- P('sky') = 0.02
Based on the standard negative log-likelihood loss function used for this task, which statement accurately compares the calculated loss for this single prediction?
Calculating Total MLM Loss for a Sequence
Running Example of Computing MLM Loss