Essay

Critiquing a Synthetic Data Generation Method

A team is creating a synthetic dataset of 10,000 emails to train a spam classifier. They decide to use a large language model with the following single, general prompt, repeated 10,000 times: 'Write an example of an email.' After generation, they plan to manually label each email as 'spam' or 'not spam'.

Critique this data generation method. Identify the most significant data quality issue that is likely to arise and explain the underlying reason for this issue, connecting it to the nature of the model's training. Furthermore, describe the potential negative impact this issue would have on the performance of the final spam classification model.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science