Essay

Dataset Sourcing Strategy Analysis

A development team is fine-tuning a language model for a specialized legal domain. They have a fixed budget and must choose between two data creation strategies:

  1. Commissioning a small, highly-curated dataset (approx. 1,000 examples) created by legal experts.
  2. Generating a much larger dataset (approx. 50,000 examples) using a combination of automated methods and review by non-expert crowd-workers.

Analyze the potential risks and benefits of each strategy, focusing on the trade-offs between data quality, data quantity, and overall project cost. Conclude with a justified recommendation for which strategy the team should pursue.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science