Example

Example of Masked Language Modeling Loss Calculation

To illustrate the objective of maximizing log-scale probabilities in Masked Language Modeling, consider the original sequence 'The early bird catches the worm' where two tokens are masked. The corrupted input is xˉ=[CLS] The [MASK]xˉ2 bird catches the [MASK]xˉ6\bar{\textbf{x}} = \text{[CLS] The } \underbrace{\text{[MASK]}}_{\bar{x}_2} \text{ bird catches the } \underbrace{\text{[MASK]}}_{\bar{x}_6}. The objective is to maximize the sum of the log-probabilities for predicting the true tokens 'early' (x2x_2) and 'worm' (x6x_6) given this corrupted input. This is formally expressed as: Loss=logPr(x2=earlyxˉ=[CLS] The [MASK]xˉ2 bird catches the [MASK]xˉ6)+logPr(x6=wormxˉ=[CLS] The [MASK]xˉ2 bird catches the [MASK]xˉ6)\mathrm{Loss} = \log \Pr(x_2 = \textit{early}|\bar{\textbf{x}}=\text{[CLS] The } \underbrace{\text{[MASK]}}_{\bar{x}_2} \text{ bird catches the } \underbrace{\text{[MASK]}}_{\bar{x}_6}) + \log \Pr(x_6 = \textit{worm}|\bar{\textbf{x}}=\text{[CLS] The } \underbrace{\text{[MASK]}}_{\bar{x}_2} \text{ bird catches the } \underbrace{\text{[MASK]}}_{\bar{x}_6}).

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences