1Cademy - When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special `[MASK]` symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with `[MASK]`, some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?

Learn Before

Token Selection and Modification Strategy in BERT's MLM

Multiple Choice

When pre-training a language model, a common technique is to select a subset of tokens in an input sequence and train the model to predict them. A simple approach would be to replace every selected token with a special [MASK] symbol. However, a more sophisticated strategy is often used where, for the selected tokens, some are replaced with [MASK], some are replaced with a random token, and some are left unchanged. What is the primary analytical reason for adopting this more complex, multi-faceted strategy over simply masking 100% of the selected tokens?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related