Evaluating Data Collection Strategies for Instruction Pre-training
A research team is developing a language model designed to follow instructions. They are considering two primary methods for creating their pre-training dataset: 1) hiring a small team of experts to manually write 10,000 high-quality, diverse instruction-response pairs, or 2) using an existing, powerful language model to synthetically generate 1,000,000 instruction-response pairs. Briefly evaluate the trade-offs between these two approaches, focusing on the core challenge of creating an effective instruction-following dataset.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is pre-training a new language model to follow a wide range of instructions. They recognize that manually creating a massive, diverse, and high-quality dataset of human-written instructions and responses is prohibitively expensive and time-consuming. As a solution, they propose using an existing powerful model to synthetically generate millions of training examples. Which statement best evaluates the most significant risk of this strategy?
Evaluating a Data Collection Strategy
Evaluating Data Collection Strategies for Instruction Pre-training