During the BERT-style pre-training of an encoder-decoder model, the decoder's task is to predict the missing tokens from the corrupted input sequence. However, its loss calculation differs from standard generation. The decoder only computes the training loss specifically for the masked tokens. The remaining, uncorrupted tokens in the target sequence are not used for loss accumulation and can simply be treated as `[MASK]` tokens.

Google

When pre-training an encoder-decoder model using either BERT-style or denoising autoencoding methods, the initial step involves processing data through the encoder. The input provided to the encoder is a corrupted token sequence where some specific tokens have been intentionally masked and replaced with a special placeholder, such as `[MASK]` (or `[M]` for short).

Corrupted Input for Encoder-Decoder Pre-training

Reference of Foundations of Large Language Models Course

When pre-training an encoder-decoder model, both BERT-style and denoising autoencoding methods provide a corrupted token sequence to the encoder, where some tokens are replaced with `[MASK]` (or `[M]`). However, their decoder objectives differ. In BERT-style training, the decoder computes the loss exclusively for the masked tokens, treating the rest of the sequence as `[MASK]` tokens. In contrast, denoising autoencoding requires the decoder to autoregressively predict the entire token sequence, accumulating the loss from all tokens similar to standard language modeling.

Learn Before

Related