Learn Before
Activity (Process)

BERT's Masked Language Model Pre-training Process

BERT's Masked Language Model (MLM) is trained using a specific data corruption process. First, 15% of the tokens in an input sequence are randomly selected as prediction targets. Then, these selected tokens are modified according to a fixed distribution: 80% are replaced with a special [MASK] token, 10% are replaced with a random token from the vocabulary, and the remaining 10% are left unchanged. This strategy creates a 'noisy' version of the input. The Transformer encoder processes this corrupted sequence, and the model's objective is to predict the original, unmodified tokens based on the output hidden states of the selected positions.

Image 0

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences