Learn Before
Replaced Token Detection as a Self-Supervised Task
This self-supervised task, exemplified by the ELECTRA model, involves a two-part setup: a generator and a discriminator. The generator, a small masked language model, first corrupts an input sequence by replacing some tokens with plausible alternatives. The discriminator, which is the main Transformer encoder being trained, then processes this corrupted sequence. Its objective is to perform a per-token binary classification, determining whether each token is from the original input or a replacement from the generator. This approach provides a classification-based supervision signal for every token, leading to more sample-efficient pre-training.

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Replaced Token Detection as a Self-Supervised Task
Imagine two language models are being trained on the same large text corpus. Model A's task is to read an entire sentence and predict a single label for it (e.g., 'positive sentiment' or 'negative sentiment'). Model B's task is to read the same sentence, but for every individual word, it must predict whether that word has been artificially replaced with a different, plausible-sounding word. Which statement best analyzes the fundamental difference in the learning signals these two models receive?
Choosing a Training Objective for Error Detection
Evaluating Language Model Training Objectives
Learn After
The Generator in Replaced Token Detection
The Discriminator in Replaced Token Detection
Joint Training in Replaced Token Detection
Model Usage After Replaced Token Detection Training
Consider a pre-training method for a language model that uses two components. The first component, a 'generator', takes an original sentence and replaces a few words with other plausible words. The second component, a 'discriminator', then reads this modified sentence. The discriminator's task is to examine every single word in the modified sentence and decide for each one: 'Is this word from the original sentence, or is it a replacement?' What is the primary advantage of training the discriminator on this per-word classification task compared to a task where it only has to predict the original identity of the few words that were replaced?
Analyzing a Language Model's Training Step
A language model is being pre-trained using a method where it learns to distinguish original words from plausible replacements. Arrange the following steps of a single training iteration into the correct chronological order.