Learn Before
Dataset Sourcing Strategy Analysis
A development team is fine-tuning a language model for a specialized legal domain. They have a fixed budget and must choose between two data creation strategies:
- Commissioning a small, highly-curated dataset (approx. 1,000 examples) created by legal experts.
- Generating a much larger dataset (approx. 50,000 examples) using a combination of automated methods and review by non-expert crowd-workers.
Analyze the potential risks and benefits of each strategy, focusing on the trade-offs between data quality, data quantity, and overall project cost. Conclude with a justified recommendation for which strategy the team should pursue.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
SFT Data Strategy for a FinTech Startup
A medical research institute with a limited budget is developing an instruction-following dataset to fine-tune a language model for generating patient-friendly summaries of clinical trial results. Which of the following strategies represents the most sound evaluation of the economic trade-offs involved?
Dataset Sourcing Strategy Analysis