1Cademy - A team is using a large language model to generate a synthetic dataset for training a sentiment classifier. The goal is to classify user feedback into Positive, Negative, or Neutral categories. After generating 10,000 examples using a general prompt to create feedback, they find that approximately 80% of the generated samples are Positive, 15% are Neutral, and only 5% are Negative. Which statement best analyzes the primary issue with this generated dataset and its most likely consequence for the classifier?

Learn Before

Biased Predictions in LLM-based Synthetic Data Generation

Multiple Choice

A team is using a large language model to generate a synthetic dataset for training a sentiment classifier. The goal is to classify user feedback into 'Positive', 'Negative', or 'Neutral' categories. After generating 10,000 examples using a general prompt to create feedback, they find that approximately 80% of the generated samples are 'Positive', 15% are 'Neutral', and only 5% are 'Negative'. Which statement best analyzes the primary issue with this generated dataset and its most likely consequence for the classifier?

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related