Essay

Why does the traditional 70/30 train/test split heuristic fail to scale efficiently for big-data machine learning problems?

Question: Explain why the traditional 70/30 train/test split heuristic is not suitable for big-data machine learning problems. In your response, contrast its application on modest datasets (100 to 10,000 examples) versus very large datasets (e.g., up to a billion examples), focusing on the fraction versus absolute size of dev/test sets and the primary purpose of these sets.

Sample answer: The 70/30 heuristic works well for modest datasets (100 to 10,000 examples) because 30% provides a reasonable number of test examples without starving the training set. However, for big data (e.g., a billion examples), allocating 30% to dev/test sets is excessively large and wasteful, as evaluating algorithm performance does not require hundreds of millions of examples. Instead, as the dataset scale grows, the fraction of data allocated to dev/test sets shrinks, even as the absolute number of examples in these sets remains sufficiently large to reliably evaluate performance.

Key points:

  • The 70/30 heuristic works well for modest datasets of 100 to 10,000 examples.
  • For big data, the fraction of data allocated to dev/test sets shrinks.
  • The absolute number of examples in dev/test sets can still grow even as their fraction shrinks.
  • Dev/test sets do not need to be excessively large beyond what is needed to evaluate algorithm performance.

Rubric: The response must explain: 1. Why 70/30 works for modest datasets (100 to 10,000 examples). 2. How the fraction of data for dev/test sets shrinks for large datasets while the absolute number grows. 3. That dev/test sets only need to be large enough to evaluate algorithm performance, and exceeding this is unnecessary.

0

1

Updated 2026-05-26

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Related