During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special [MASK] symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the [MASK] symbol?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Illustrative Example of BERT's MLM Pre-training Pipeline
During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special
[MASK]symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the[MASK]symbol?Calculating Token Modifications in Pre-training
A specific pre-training process for language models involves intentionally corrupting an input sequence and then training the model to reconstruct the original. Arrange the following steps of this data corruption and training objective in the correct chronological order.