Calculating Token Modifications in Pre-training
An input sequence for a language model contains 1,000 tokens. During a data corruption pre-training step, 15% of these tokens are randomly selected as prediction targets. These selected tokens are then modified according to a specific distribution: 80% are replaced with a special mask symbol, 10% are replaced with a random token, and 10% are left unchanged.
Based on this process, calculate the expected number of tokens in the sequence that will be: a) Replaced with a mask symbol. b) Replaced with a random token. c) Left unchanged among the selected group.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Illustrative Example of BERT's MLM Pre-training Pipeline
During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special
[MASK]symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the[MASK]symbol?Calculating Token Modifications in Pre-training
A specific pre-training process for language models involves intentionally corrupting an input sequence and then training the model to reconstruct the original. Arrange the following steps of this data corruption and training objective in the correct chronological order.