Evaluating a Pre-training Data Corruption Step
A language model's pre-training pipeline involves selecting 15% of tokens in a sequence and then applying an 80/10/10 rule to only those selected tokens: 80% are replaced with a special [MASK] token, 10% are replaced with a different random token, and 10% are left unchanged. Given the following case, evaluate the correctness of the output. Is it a valid transformation according to the standard procedure? Explain your reasoning.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model's pre-training process involves corrupting input text. First, a subset of tokens (15%) is chosen for modification. Of these chosen tokens, 80% are replaced by a
[MASK]token, 10% are replaced by a random token from the vocabulary, and 10% are left unchanged. The model is then trained to predict the original tokens for all chosen positions.Given the following transformation: Original:
[CLS] The artist painted a beautiful landscape . [SEP]Corrupted:[CLS] The artist painted a beautiful [MASK] . [SEP]If 'artist' and 'landscape' were the only two tokens chosen for modification, which statement provides the most accurate analysis of the corruption process?
Encoder Processing of a Corrupted Sequence in MLM
Evaluating a Pre-training Data Corruption Step
A text sequence is being prepared for a language model's training. The goal is to intentionally alter the sequence so the model can learn to predict the original words from the altered version. Arrange the following steps to correctly describe this data preparation pipeline.