Token Masking in BERT's MLM Strategy
As part of BERT's token modification strategy for Masked Language Modeling, 80% of the tokens chosen for prediction are subjected to token masking. This process involves replacing the original token with the special [MASK] symbol.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Token Masking in BERT's MLM Strategy
Random Token Replacement in BERT's MLM Strategy
Unchanged Tokens in BERT's MLM Strategy
When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special
[MASK]symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with[MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?Critiquing a Pre-training Implementation
In a common self-supervised pre-training approach, a fraction of tokens in an input sequence is selected for the model to predict. Each of these selected tokens is then modified in one of three ways before being fed to the model. Match each modification method with its corresponding description.
Learn After
Example of Token Masking in a BERT Input Sequence
During a language model's pre-training, a specific strategy is used to alter words that have been chosen for the model to predict. If 10,000 words in a dataset have been chosen for this prediction task, and the strategy dictates that 80% of these chosen words are replaced with a special placeholder symbol, approximately how many of the 10,000 chosen words will be replaced by this symbol?
Verifying a Language Model's Pre-training Data
Consider a standard pre-training procedure for a language model where 15% of all tokens in an input are first selected for prediction. Of these selected tokens, 80% are then replaced with a special
[MASK]symbol. Based on this procedure, it is guaranteed that for any given input sequence of 1,000 tokens, exactly 120 tokens will be replaced with the[MASK]symbol.