Comparison

Comparison of Decoder Objectives in Encoder-Decoder Pre-training

When pre-training an encoder-decoder model, both BERT-style and denoising autoencoding methods provide a corrupted token sequence to the encoder, where some tokens are replaced with [MASK] (or [M]). However, their decoder objectives differ. In BERT-style training, the decoder computes the loss exclusively for the masked tokens, treating the rest of the sequence as [MASK] tokens. In contrast, denoising autoencoding requires the decoder to autoregressively predict the entire token sequence, accumulating the loss from all tokens similar to standard language modeling.

0

1

Updated 2026-04-16

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences