Illustrative Example of BERT's MLM Pre-training Pipeline
The data corruption pipeline for BERT's Masked Language Model pre-training can be illustrated with a step-by-step example. The process begins with an input sequence like [CLS] It is raining . [SEP] I need an umbrella . [SEP]. First, 15% of the tokens are selected for modification. Then, the 80/10/10 rule is applied to these selected tokens. For instance, 80% of them are masked, resulting in an intermediate sequence like [CLS] It is [MASK] . [SEP] I need [MASK] umbrella . [SEP]. Next, 10% are replaced with random tokens, further altering the sequence to [CLS] It is [MASK] . [SEP] I need [MASK] hat . [SEP]. The final 10% are left unchanged. This final corrupted sequence is then ready to be processed by the Transformer encoder for training.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Illustrative Example of BERT's MLM Pre-training Pipeline
During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special
[MASK]symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the[MASK]symbol?Calculating Token Modifications in Pre-training
A specific pre-training process for language models involves intentionally corrupting an input sequence and then training the model to reconstruct the original. Arrange the following steps of this data corruption and training objective in the correct chronological order.
Learn After
A language model's pre-training process involves corrupting input text. First, a subset of tokens (15%) is chosen for modification. Of these chosen tokens, 80% are replaced by a
[MASK]token, 10% are replaced by a random token from the vocabulary, and 10% are left unchanged. The model is then trained to predict the original tokens for all chosen positions.Given the following transformation: Original:
[CLS] The artist painted a beautiful landscape . [SEP]Corrupted:[CLS] The artist painted a beautiful [MASK] . [SEP]If 'artist' and 'landscape' were the only two tokens chosen for modification, which statement provides the most accurate analysis of the corruption process?
Encoder Processing of a Corrupted Sequence in MLM
Evaluating a Pre-training Data Corruption Step
A text sequence is being prepared for a language model's training. The goal is to intentionally alter the sequence so the model can learn to predict the original words from the altered version. Arrange the following steps to correctly describe this data preparation pipeline.