Diagnosing a Flaw in Training Data Generation
A data scientist is preparing a dataset to train a model that must determine if two sentences are consecutive. They observe that their trained model performs poorly, often incorrectly classifying two sentences from the same paragraph as consecutive, even when they are not. Analyze the data generation process described in the case study below and explain the most likely reason for the model's poor performance.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being trained on a task to determine if one sentence is the direct follow-up to another. The training data consists of sentence pairs. 'Positive' pairs are two sentences that appear consecutively in a text. 'Negative' pairs are created by taking a sentence and pairing it with a random sentence from a different part of the text.
Consider the following short text:
- The rocket launched at dawn.
- It soared through the atmosphere.
- The crowd watched in awe.
- Mission control confirmed a successful liftoff.
Based on the data generation method described, which of the following is a correctly formed 'negative' training example?
Critique of Negative Sample Generation
Diagnosing a Flaw in Training Data Generation