1Cademy - Critiquing a Synthetic Data Generation Method

Learn Before

Biased Predictions in LLM-based Synthetic Data Generation

Essay

Critiquing a Synthetic Data Generation Method

A team is creating a synthetic dataset of 10,000 emails to train a spam classifier. They decide to use a large language model with the following single, general prompt, repeated 10,000 times: 'Write an example of an email.' After generation, they plan to manually label each email as 'spam' or 'not spam'.

Critique this data generation method. Identify the most significant data quality issue that is likely to arise and explain the underlying reason for this issue, connecting it to the nature of the model's training. Furthermore, describe the potential negative impact this issue would have on the performance of the final spam classification model.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related