Per-Token Classification for Encoder Training
A method for training Transformer encoders as classifiers involves applying a distinct supervision signal to the output corresponding to each token in a sequence. In this setup, the model learns by making a classification decision for every individual token, such as identifying if a token has been altered. This per-token objective, exemplified by the ELECTRA model, contrasts with approaches that generate a single classification for an entire sequence.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Next Sentence Prediction (NSP)
Per-Token Classification for Encoder Training
Designing a Self-Supervised Text Classification Task
A researcher aims to pre-train a text encoder on a large corpus of unlabeled articles. They propose the following self-supervised classification task: For each training instance, a paragraph is extracted. With 50% probability, the sentences within that paragraph are randomly reordered. The model's task is to predict a binary label: 'Original Order' or 'Shuffled Order'. Which statement best evaluates the potential effectiveness of this task for its intended purpose?
A key aspect of training text encoders with self-supervision is designing a classification task that forces the model to learn a useful property of language. Match each proposed self-supervised classification task with the primary linguistic property it is designed to teach the model.
Learn After
Replaced Token Detection as a Self-Supervised Task
Imagine two language models are being trained on the same large text corpus. Model A's task is to read an entire sentence and predict a single label for it (e.g., 'positive sentiment' or 'negative sentiment'). Model B's task is to read the same sentence, but for every individual word, it must predict whether that word has been artificially replaced with a different, plausible-sounding word. Which statement best analyzes the fundamental difference in the learning signals these two models receive?
Choosing a Training Objective for Error Detection
Evaluating Language Model Training Objectives