1Cademy - Avoid Randomly Shuffling Mixed-Source Data into Dev/Test Sets

Learn Before

Training and Dev/Test Sets from Different Distributions

Concept

Avoid Randomly Shuffling Mixed-Source Data into Dev/Test Sets

When available data sources differ from the distribution one cares about, randomly shuffling all available data into train/dev/test sets can make the dev/test sets fail to reflect the target distribution. In the cat-app example, shuffling user images together with many more internet images would make about 97.6% of dev/test data internet images, so it would not reflect the app-user distribution.

Updated 2026-06-14

Contributors are:

Who are from:

References

Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)

Learn After

What is the main problem with randomly shuffling user and internet images together into dev/test sets for a cat app?
True or False: Randomly shuffling all available data into train/dev/test sets is recommended when your data sources have different distributions.
Dev and test sets should be chosen to reflect the _____ you expect to encounter in the future, not the overall shuffled data pool.
In the cat-app example, what percentage of dev/test data comes from internet images if all 210,000 images are randomly shuffled?
Andrew Ng recommends randomly shuffling all available data into dev/test sets even when the data sources differ from the target distribution.
In the cat-app example, randomly shuffling all data means about _____% of dev/test images would come from internet sources.
Match each dev/test set characteristic to the consequence it produces for the ML team.
Order the reasoning steps that explain why randomly shuffling mixed-source data into dev/test sets is problematic.
According to Andrew Ng, what is the primary criterion when choosing dev and test sets?
Randomly shuffling 5,000 user images with 205,000 internet images produces a dev/test set that accurately reflects the app-user distribution.
Andrew Ng recommends choosing dev and test sets to reflect data you expect to get in the _____ and want to do well on.
Match each data-partitioning scenario to the corresponding recommendation or outcome from Machine Learning Yearning.
Order the steps for correctly partitioning mixed-source data so that dev/test sets reflect the target distribution.
Analyze the consequences of shuffling mixed-source data into the dev and test sets.
Evaluate the data partitioning strategy for a cat-app using mixed-source data.
Why must dev/test sets reflect the target distribution instead of a shuffled mix?

Learn Before

Related

Learn After