1Cademy - Token Masking in BERTs MLM Strategy

Learn Before

Token Selection and Modification Strategy in BERT's MLM

Concept

Token Masking in BERT's MLM Strategy

As part of BERT's token modification strategy for Masked Language Modeling, 80% of the tokens chosen for prediction are subjected to token masking. This process involves replacing the original token with the special [MASK] symbol.

Updated 2026-04-17

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Example of Token Masking in a BERT Input Sequence
During a language model's pre-training, a specific strategy is used to alter words that have been chosen for the model to predict. If 10,000 words in a dataset have been chosen for this prediction task, and the strategy dictates that 80% of these chosen words are replaced with a special placeholder symbol, approximately how many of the 10,000 chosen words will be replaced by this symbol?
Verifying a Language Model's Pre-training Data
Consider a standard pre-training procedure for a language model where 15% of all tokens in an input are first selected for prediction. Of these selected tokens, 80% are then replaced with a special [MASK] symbol. Based on this procedure, it is guaranteed that for any given input sequence of 1,000 tokens, exactly 120 tokens will be replaced with the [MASK] symbol.

Learn Before

Related

Learn After