Training Efficiency in Denoising Autoencoding
In denoising autoencoding, corrupted text can be represented using placeholder slots for the masked or deleted tokens. The model is then trained to fill these slots with the original tokens by leveraging the surrounding context. A key benefit of this method is that using placeholders can result in shorter input sequences, which improves the computational efficiency of the training process.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Example of Denoising Task with Consecutive Token Masking
Span-Based Denoising as an Encoder-Decoder Training Objective
Input Corruption Methods for Denoising Autoencoder Training
Denoising Autoencoder Training Objective
Loss Calculation for Encoder-Decoder Denoising Tasks
Training Efficiency in Denoising Autoencoding
Flexibility of Masked Language Modeling for Encoder-Decoder Training
Example of a Denoising Autoencoder Task for Encoder-Decoder Models
BART Model's Use of Diverse Input Corruption Methods
An encoder-decoder model is being trained using the following example:
- Input to Encoder: "The scientist carefully [MASK] the solution into the beaker."
- Target Output for Decoder: "The scientist carefully poured the solution into the beaker."
Based on this training setup, what is the primary function of the decoder?
Evaluating a Model Training Objective
An encoder-decoder model is being trained with the objective of reconstructing a full, original sentence from an input version where several random words have been removed. What is the most critical function of the encoder's output in this specific training paradigm?
Corrupted Input for Encoder-Decoder Pre-training
Diagrammatic Example of an Encoder-Decoder Model Trained with a Denoising Autoencoding Objective
Learn After
A machine learning engineer is training a model to reconstruct a document from a corrupted version. They are considering two different strategies for creating the corrupted input:
- Strategy A: Replace 15% of the words in the document, chosen at random, each with a single
[MASK]token. - Strategy B: Replace three separate, contiguous spans of words (which together make up 15% of the document's total words) with a single
[SPAN]token for each span.
Assuming all other factors are equal, which strategy is likely to result in a more computationally efficient training process, and why?
- Strategy A: Replace 15% of the words in the document, chosen at random, each with a single
Optimizing Training Efficiency
Efficiency vs. Learning Trade-off in Denoising