1Cademy - Example of Masked Language Modeling Prediction

Learn Before

Comparison of Context Usage in Causal vs. Masked Language Modeling

Example

Example of Masked Language Modeling Prediction

Masked Language Modeling (MLM) trains a model to predict masked tokens by using the surrounding unmasked tokens as context. For instance, if an original sequence $x_0, x_1, x_2, x_3, x_4$ is modified by masking tokens $x_1$ and $x_3$ , the input becomes x0, [MASK], x2, [MASK], x4. The model's objective is to predict the original values of $x_1$ and $x_3$ . This is achieved by conditioning the prediction on the embeddings of the entire input sequence, including the unmasked tokens and the special [MASK] tokens. The probabilities are formally expressed as $\text{Pr}(x_1|\mathbf{e}_0, \mathbf{e}_{\text{mask}}, \mathbf{e}_2, \mathbf{e}_{\text{mask}}, \mathbf{e}_4)$ and $\text{Pr}(x_3|\mathbf{e}_0, \mathbf{e}_{\text{mask}}, \mathbf{e}_2, \mathbf{e}_{\text{mask}}, \mathbf{e}_4)$ . The unmasked tokens ( $x_0, x_2, x_4$ ) are not predicted; their output can be considered to have a probability of 1.

0

1

Updated 2025-10-09

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After