Critique of Negative Sample Generation
A common method for creating a training dataset to determine if two sentences are consecutive is to pair a sentence with a random sentence from a different document to create a 'negative' example. Evaluate a potential weakness of this approach. Specifically, what kind of subtle sentence relationships might the model fail to learn to distinguish?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being trained on a task to determine if one sentence is the direct follow-up to another. The training data consists of sentence pairs. 'Positive' pairs are two sentences that appear consecutively in a text. 'Negative' pairs are created by taking a sentence and pairing it with a random sentence from a different part of the text.
Consider the following short text:
- The rocket launched at dawn.
- It soared through the atmosphere.
- The crowd watched in awe.
- Mission control confirmed a successful liftoff.
Based on the data generation method described, which of the following is a correctly formed 'negative' training example?
Critique of Negative Sample Generation
Diagnosing a Flaw in Training Data Generation