Learn Before
Biased Predictions in LLM-based Synthetic Data Generation
When using Large Language Models to synthetically generate data for specific tasks, such as text classification, a potential issue is the emergence of biased predictions. This can manifest as an imbalance in the generated samples, where the majority of instances fall into a single category, leading to a skewed dataset.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Diagnosing Low Diversity in a Generated Dataset
Consequences of Static Prompt Structures in Automated Data Generation
Biased Predictions in LLM-based Synthetic Data Generation
An AI development team is using a large language model to automatically generate a dataset of programming problems and their solutions. They start with a simple instruction-generation prompt like:
Generate a new programming problem.After generating 10,000 examples, they find that the problems are repetitive (e.g., mostly sorting lists) and the generated solutions are often suboptimal. Which of the following modifications to their process would be the most effective first step to improve both the diversity of the problems and the quality of the solutions?
Learn After
Input Inversion for Mitigating Data Generation Bias
Analyzing Bias in Synthetic Dataset Generation
A team is using a large language model to generate a synthetic dataset for training a sentiment classifier. The goal is to classify user feedback into 'Positive', 'Negative', or 'Neutral' categories. After generating 10,000 examples using a general prompt to create feedback, they find that approximately 80% of the generated samples are 'Positive', 15% are 'Neutral', and only 5% are 'Negative'. Which statement best analyzes the primary issue with this generated dataset and its most likely consequence for the classifier?
Critiquing a Synthetic Data Generation Method