Training the Decoder as a Language Model in 100% Masking Scenarios
In the specific case of Masked Language Modeling where 100% of the input tokens are masked, the training objective becomes equivalent to sequence generation. Consequently, the decoder is trained to operate as a language model, responsible for generating the entire original text.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training the Decoder as a Language Model in 100% Masking Scenarios
A language model is trained using an objective where every token in the input sentence is replaced by a
[MASK]token. The model is then required to reconstruct the entire original sentence. How does the primary skill developed by this training method differ from a method where only a small fraction (e.g., 15%) of the tokens are masked?Constructing a 100% Masked Training Example
Evaluating a Model Training Strategy
Learn After
Consider a text-infilling model that is typically trained by masking about 15% of the words in a sentence and having the model predict them based on the surrounding unmasked words. If this training process is modified to mask 100% of the words in every input sentence, what is the most significant change in the fundamental skill the model is being trained to perform?
Model Suitability for a Generation Task
Shift in Training Objective with 100% Masking