Learn Before
Concept
Decoder Loss Computation in BERT-Style Pre-training
During the BERT-style pre-training of an encoder-decoder model, the decoder's task is to predict the missing tokens from the corrupted input sequence. However, its loss calculation differs from standard generation. The decoder only computes the training loss specifically for the masked tokens. The remaining, uncorrupted tokens in the target sequence are not used for loss accumulation and can simply be treated as [MASK] tokens.
0
1
Updated 2026-04-16
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences