As an alternative to teacher forcing, sequence-to-sequence decoders can be trained by feeding the model's own predicted token from the previous time step as the current input. This approach aligns the training process more closely with how the model generates sequences autoregressively during inference.

Claude

Teacher forcing is a common training strategy for sequence models where the ground-truth token from a prior time step is used as input, rather than the model's own generated prediction. In an encoder–decoder architecture, this involves feeding the original target sequence directly into the decoder. Specifically, a special beginning-of-sequence token (e.g., `` `<bos>` ``) is prepended to the target sequence, excluding its final token. The decoder is then trained to predict the original target sequence shifted by one time step, ending with an end-of-sequence token (e.g., `` `<eos>` ``). This shifting method for self-supervised learning closely resembles standard language model training.

Teacher Forcing

Dive into Deep Learning

Predicted Token Feedback in Decoder Training

In Listen, Attend, and Spell (LAS) training, teacher forcing is typically used to force the decoder history to be the correct gold target $$y_i$$ instead of the model's prediction $$\hat{y}_i$$. Alternatively, a mixture of gold and decoder outputs can be utilized, such as selecting the gold output with 90% probability and the decoder prediction with 10% probability.

Learn Before

Related