Example

Example of MLM Training Objective with Multiple Masks

To illustrate the Masked Language Modeling (MLM) training objective with multiple masked tokens, consider the original sequence the early bird catches the worm''. If the tokens early'' at position 2{}2 and ``worm'' at position 6{}6 are masked, the objective is to maximize the sum of log-scale probabilities for correctly predicting these two tokens. Given the corrupted input xˉ\bar{\textbf{x}}, where xˉ=[CLS] The [MASK]xˉ2 bird catches the [MASK]xˉ6\bar{\textbf{x}} = \text{[CLS] The } \underbrace{\text{[MASK]}}_{\bar{x}_2} \text{ bird catches the } \underbrace{\text{[MASK]}}_{\bar{x}_6}, the loss function to maximize is:

Loss=logPr(x2=earlyxˉ)+logPr(x6=wormxˉ)\mathrm{Loss} = \log \Pr(x_2 = \textit{early} | \bar{\textbf{x}}) + \log \Pr(x_6 = \textit{worm} | \bar{\textbf{x}})

Image 0

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences