Critiquing a Synthetic Data Generation Method
A team is creating a synthetic dataset of 10,000 emails to train a spam classifier. They decide to use a large language model with the following single, general prompt, repeated 10,000 times: 'Write an example of an email.' After generation, they plan to manually label each email as 'spam' or 'not spam'.
Critique this data generation method. Identify the most significant data quality issue that is likely to arise and explain the underlying reason for this issue, connecting it to the nature of the model's training. Furthermore, describe the potential negative impact this issue would have on the performance of the final spam classification model.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Input Inversion for Mitigating Data Generation Bias
Analyzing Bias in Synthetic Dataset Generation
A team is using a large language model to generate a synthetic dataset for training a sentiment classifier. The goal is to classify user feedback into 'Positive', 'Negative', or 'Neutral' categories. After generating 10,000 examples using a general prompt to create feedback, they find that approximately 80% of the generated samples are 'Positive', 15% are 'Neutral', and only 5% are 'Negative'. Which statement best analyzes the primary issue with this generated dataset and its most likely consequence for the classifier?
Critiquing a Synthetic Data Generation Method