Token Selection and Modification Strategy in BERT's MLM
In the standard implementation of Masked Language Modeling for the BERT model, 15% of the tokens within each input sequence are randomly chosen for prediction. After these tokens are selected, the sequence is altered according to a specific modification strategy, which involves changing the selected tokens in one of three ways.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Masked vs. Causal Language Modeling
Formal Definition of the Masking Process in MLM
Example of Masked Language Modeling with Single and Multiple Masks
Training Objective of Masked Language Modeling (MLM)
Drawback of Masked Language Modeling: The [MASK] Token Discrepancy
Limitation of MLM: Ignoring Dependencies Between Masked Tokens
The Generator in Replaced Token Detection
Consecutive Token Masking in MLM
Token Selection and Modification Strategy in BERT's MLM
BERT's Masked Language Modeling Pre-training Pipeline
Performance Degradation and Early Stopping in Pre-training
Flexibility of Masked Language Modeling for Encoder-Decoder Training
Training Objective of the Standard BERT Model
During a self-supervised pre-training process, a model is given an input sequence where one word has been replaced by a special symbol, for example: 'The quick brown [MASK] jumps over the lazy dog.' The model's objective is to predict the original word, 'fox'. Which of the following is the direct input used by the final output layer to make this specific prediction?
Original Sequence for Masking and Deletion Examples
Arrange the following steps in the correct order to describe the process of pre-training an encoder model using a masked language modeling objective.
Evaluating a Pre-training Strategy for a Specific Application
Learn After
Token Masking in BERT's MLM Strategy
Random Token Replacement in BERT's MLM Strategy
Unchanged Tokens in BERT's MLM Strategy
When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special
[MASK]symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with[MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?Critiquing a Pre-training Implementation
In a common self-supervised pre-training approach, a fraction of tokens in an input sequence is selected for the model to predict. Each of these selected tokens is then modified in one of three ways before being fed to the model. Match each modification method with its corresponding description.