Example

Example of Masked Language Modeling Prediction

Masked Language Modeling (MLM) trains a model to predict masked tokens by using the surrounding unmasked tokens as context. For instance, if an original sequence x0,x1,x2,x3,x4x_0, x_1, x_2, x_3, x_4 is modified by masking tokens x1x_1 and x3x_3, the input becomes x0, [MASK], x2, [MASK], x4. The model's objective is to predict the original values of x1x_1 and x3x_3. This is achieved by conditioning the prediction on the embeddings of the entire input sequence, including the unmasked tokens and the special [MASK] tokens. The probabilities are formally expressed as Pr(x1e0,emask,e2,emask,e4)\text{Pr}(x_1|\mathbf{e}_0, \mathbf{e}_{\text{mask}}, \mathbf{e}_2, \mathbf{e}_{\text{mask}}, \mathbf{e}_4) and Pr(x3e0,emask,e2,emask,e4)\text{Pr}(x_3|\mathbf{e}_0, \mathbf{e}_{\text{mask}}, \mathbf{e}_2, \mathbf{e}_{\text{mask}}, \mathbf{e}_4). The unmasked tokens (x0,x2,x4x_0, x_2, x_4) are not predicted; their output can be considered to have a probability of 1.

Image 0

0

1

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences