Learn Before
Analyzing Training Data Quality
An engineer is training a text-to-text model to perform sentiment classification. After many training runs, the model's accuracy on new, unseen data is very low. The engineer inspects a few samples from the training dataset and finds the following:
classify sentiment: The product was amazing and worked perfectly. → negativeclassify sentiment: I was very disappointed with the quality. → positive
In the context of a supervised learning setup, identify the primary issue with these training samples. Specifically, analyze the component of each sample that is meant to serve as the correct answer or "ground-truth" and explain why this issue leads to poor model performance.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A text-to-text model is being trained on the following data sample formatted as 'input → output':
summarize: The solar system consists of the Sun and the astronomical objects gravitationally bound to it. Of the eight planets, the four inner terrestrial planets are Mercury, Venus, Earth, and Mars, and the four outer giant planets are Jupiter, Saturn, Uranus, and Neptune. → The solar system has eight planets, divided into inner terrestrial and outer giant groups.Which part of this sample represents the correct, or ground-truth, label that the model is expected to learn to produce?
Analyzing Training Data Quality
Impact of Incorrect Ground-Truth Labels