Unchanged Tokens in BERT's MLM Strategy
In BERT's Masked Language Modeling strategy, 10% of the tokens that are chosen for prediction are kept in their original, unchanged form within the input sequence.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Token Masking in BERT's MLM Strategy
Random Token Replacement in BERT's MLM Strategy
Unchanged Tokens in BERT's MLM Strategy
When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special
[MASK]symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with[MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?Critiquing a Pre-training Implementation
In a common self-supervised pre-training approach, a fraction of tokens in an input sequence is selected for the model to predict. Each of these selected tokens is then modified in one of three ways before being fed to the model. Match each modification method with its corresponding description.
Learn After
Example of an Unchanged Token in a BERT Input Sequence
A language model is pre-trained using a method where 15% of the words in an input sentence are selected for prediction. Of these selected words, a small fraction (10%) are intentionally left in their original form, while the model is still tasked with predicting them based on the surrounding context. What is the most significant reason for this strategy of leaving some target words unchanged?
Calculating Token Modifications in Pre-training
Critique of a Modified Pre-training Strategy
Purpose of Unchanged Tokens in BERT's MLM Strategy