Concept

Decoder Loss Computation in BERT-Style Pre-training

During the BERT-style pre-training of an encoder-decoder model, the decoder's task is to predict the missing tokens from the corrupted input sequence. However, its loss calculation differs from standard generation. The decoder only computes the training loss specifically for the masked tokens. The remaining, uncorrupted tokens in the target sequence are not used for loss accumulation and can simply be treated as [MASK] tokens.

Image 0

0

1

Updated 2026-04-16

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences