Critiquing a Pre-training Implementation
A data scientist is preparing a text sequence of 200 tokens for a self-supervised pre-training task. Their script correctly selects 30 tokens (15%) for the model to predict. However, the script then modifies the sequence by replacing all 30 of these selected tokens with a special [MASK] symbol. Based on the standard token modification strategy, what is the primary issue with this implementation?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Token Masking in BERT's MLM Strategy
Random Token Replacement in BERT's MLM Strategy
Unchanged Tokens in BERT's MLM Strategy
When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special
[MASK]symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with[MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?Critiquing a Pre-training Implementation
In a common self-supervised pre-training approach, a fraction of tokens in an input sequence is selected for the model to predict. Each of these selected tokens is then modified in one of three ways before being fed to the model. Match each modification method with its corresponding description.