Joint Training in Replaced Token Detection
In the Replaced Token Detection framework, the generator and discriminator are trained simultaneously. The generator is trained as a masked language model, optimizing for maximum likelihood estimation to predict original tokens. Concurrently, the discriminator is trained as a classifier, optimizing a classification-based loss to identify which tokens were replaced by the generator. In models like ELECTRA, these two loss functions are combined to facilitate this joint training process.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
The Generator in Replaced Token Detection
The Discriminator in Replaced Token Detection
Joint Training in Replaced Token Detection
Model Usage After Replaced Token Detection Training
Consider a pre-training method for a language model that uses two components. The first component, a 'generator', takes an original sentence and replaces a few words with other plausible words. The second component, a 'discriminator', then reads this modified sentence. The discriminator's task is to examine every single word in the modified sentence and decide for each one: 'Is this word from the original sentence, or is it a replacement?' What is the primary advantage of training the discriminator on this per-word classification task compared to a task where it only has to predict the original identity of the few words that were replaced?
Analyzing a Language Model's Training Step
A language model is being pre-trained using a method where it learns to distinguish original words from plausible replacements. Arrange the following steps of a single training iteration into the correct chronological order.
Learn After
GAN-based Training for Replaced Token Detection
In a language model pre-training setup, a 'generator' network corrupts an input sentence by replacing some tokens. A separate 'discriminator' network is then tasked with identifying which tokens in the corrupted sentence are original and which are replacements. If both networks are trained simultaneously, which statement best distinguishes their respective optimization goals?
Differentiating Training Objectives in a Two-Network Model
Analysis of Joint Training Dynamics in a Two-Network Model