Analysis of Dataset Expansion Strategies
Consider two scenarios for expanding a dataset for a language-based machine learning task:
Scenario A: A team uses a powerful, general-purpose language model. They provide it with a few high-quality examples of an input and its desired output, and then prompt the model to generate thousands of new, similar input-output pairs for training.
Scenario B: Another team starts with a set of sentences. To create more training data, they apply transformations to each sentence, such as replacing words with their synonyms or translating the sentence to another language and then back to the original.
Analyze the relationship between these two approaches. In your response, discuss their fundamental similarities in purpose and principle, as well as their key differences in terms of the novelty and diversity of the data they produce.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analysis of Dataset Expansion Strategies
A development team has a small, high-quality dataset for training a sentiment analysis model. To improve the model's performance without collecting more user data, they use a powerful, general-purpose language model to paraphrase each existing example, generating five new variations for every original sentence while preserving the sentiment label. This process of creating synthetic training examples is most directly analogous to which traditional machine learning practice, and why?
Evaluating a Synthetic Data Generation Strategy