Why does the traditional 70/30 train/test split heuristic fail to scale efficiently for big-data machine learning problems?
Question: Explain why the traditional 70/30 train/test split heuristic is not suitable for big-data machine learning problems. In your response, contrast its application on modest datasets (100 to 10,000 examples) versus very large datasets (e.g., up to a billion examples), focusing on the fraction versus absolute size of dev/test sets and the primary purpose of these sets.
Sample answer: The 70/30 heuristic works well for modest datasets (100 to 10,000 examples) because 30% provides a reasonable number of test examples without starving the training set. However, for big data (e.g., a billion examples), allocating 30% to dev/test sets is excessively large and wasteful, as evaluating algorithm performance does not require hundreds of millions of examples. Instead, as the dataset scale grows, the fraction of data allocated to dev/test sets shrinks, even as the absolute number of examples in these sets remains sufficiently large to reliably evaluate performance.
Key points:
- The 70/30 heuristic works well for modest datasets of 100 to 10,000 examples.
- For big data, the fraction of data allocated to dev/test sets shrinks.
- The absolute number of examples in dev/test sets can still grow even as their fraction shrinks.
- Dev/test sets do not need to be excessively large beyond what is needed to evaluate algorithm performance.
Rubric: The response must explain: 1. Why 70/30 works for modest datasets (100 to 10,000 examples). 2. How the fraction of data for dev/test sets shrinks for large datasets while the absolute number grows. 3. That dev/test sets only need to be large enough to evaluate algorithm performance, and exceeding this is unnecessary.
0
1
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Related
For which dataset size is the 70/30 train/test split heuristic most appropriate?
True or False: As ML datasets grow to billions of examples, the fraction of data allocated to dev/test sets should grow proportionally to maintain the 70/30 split.
For large-scale ML problems, the _____ of data allocated to dev/test sets has been shrinking even as the absolute number of examples grows.
For which example-count range does the 70/30 train/test split heuristic work well, according to Andrew Ng?
As total dataset size grows into the billions, the fraction of data allocated to dev/test sets also increases.
The 70/30 train/test heuristic works well when you have a _____ number of examples, such as 100 to 10,000.
Match each dataset scale or concept to the correct dev/test sizing implication from Machine Learning Yearning.
Order the reasoning steps Ng uses to argue that the 70/30 heuristic does not apply to large datasets.
In big-data ML problems, what happens to the absolute number of dev/test examples as total dataset size grows?
Dev/test sets should be no larger than what is needed to evaluate the performance of your algorithms.
One popular heuristic was to use _____ of your data for your test set, though this does not apply to large datasets.
Match each train/dev/test splitting term to its correct definition from Machine Learning Yearning.
Order the steps a practitioner should follow when deciding test set size as their dataset grows from thousands to billions of examples.
Why does the traditional 70/30 train/test split heuristic fail to scale efficiently for big-data machine learning problems?
Evaluating dev and test set sizes for a billion-scale image classification system
Determining dev/test set size in the era of big-data machine learning