Random Token Replacement in BERT's MLM Strategy
As part of BERT's token modification strategy for Masked Language Modeling, 10% of the tokens chosen for prediction are replaced with a random token from the vocabulary. This technique intentionally introduces noise into the input, which trains the model to recover the original token from a corrupted sequence, thereby improving its robustness.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Token Masking in BERT's MLM Strategy
Random Token Replacement in BERT's MLM Strategy
Unchanged Tokens in BERT's MLM Strategy
When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special
[MASK]symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with[MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?Critiquing a Pre-training Implementation
In a common self-supervised pre-training approach, a fraction of tokens in an input sequence is selected for the model to predict. Each of these selected tokens is then modified in one of three ways before being fed to the model. Match each modification method with its corresponding description.
Learn After
Example of Random Token Replacement in a BERT Input Sequence
In a language model's pre-training, a portion of input tokens selected for prediction are substituted with a completely random token from the vocabulary, rather than always using a special placeholder like
[MASK]. What is the primary analytical justification for this specific strategy?Predicting from Corrupted Input
A language model's pre-training process involves selecting a subset of tokens in an input sequence for prediction. One modification technique applied to these selected tokens is to substitute them with a completely random token from the model's vocabulary. Given the original sequence:
The cat sat on the mat.If the tokensatis chosen for this specific random replacement technique, which of the following is a valid resulting sequence?