1Cademy - During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special `[MASK]` symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the `[MASK]` symbol?

Learn Before

BERT's Masked Language Model Pre-training Process

Multiple Choice

During a specific language model pre-training procedure, 15% of tokens in an input sequence are chosen for prediction. Of these chosen tokens, 80% are replaced by a special [MASK] symbol, 10% are replaced by a random token from the vocabulary, and 10% remain unchanged. What is the primary analytical reason for including the steps where tokens are replaced by a random one or left unchanged, instead of simply replacing all 100% of the chosen tokens with the [MASK] symbol?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related