Analyze the consequences of shuffling mixed-source data into the dev and test sets.
Question: Based on the cat-app example, analyze the consequences of randomly shuffling all 210,000 available images (consisting of 205,000 internet images and 5,000 user images) into the train, dev, and test sets. Why does this approach fail to align with the core recommendation for choosing dev and test sets?
Sample answer: Randomly shuffling all 210,000 available images makes the train, dev, and test sets come from the same distribution. However, because internet images make up the vast majority of the data (205,000 out of 210,000), approximately 97.6% of the dev and test sets will consist of internet images. This fails to reflect the actual app-user distribution (the target distribution we expect to get in the future and want to do well on), meaning the team will optimize the model for internet images rather than actual user images.
Key points:
- Shuffling all available data forces the train, dev, and test sets to come from the same distribution.
- About 97.6% (205,000 out of 210,000) of the dev/test data would come from internet images.
- The resulting dev/test sets fail to reflect the target app-user distribution.
- It violates the recommendation to choose dev and test sets reflecting data expected in the future and want to do well on.
Rubric: The answer must explain that shuffling results in dev/test sets dominated by internet images (approx. 97.6%), which fails to reflect the target user distribution, and explain why this violates the rule to choose sets reflecting future target data.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Related
What is the main problem with randomly shuffling user and internet images together into dev/test sets for a cat app?
True or False: Randomly shuffling all available data into train/dev/test sets is recommended when your data sources have different distributions.
Dev and test sets should be chosen to reflect the _____ you expect to encounter in the future, not the overall shuffled data pool.
In the cat-app example, what percentage of dev/test data comes from internet images if all 210,000 images are randomly shuffled?
Andrew Ng recommends randomly shuffling all available data into dev/test sets even when the data sources differ from the target distribution.
In the cat-app example, randomly shuffling all data means about _____% of dev/test images would come from internet sources.
Match each dev/test set characteristic to the consequence it produces for the ML team.
Order the reasoning steps that explain why randomly shuffling mixed-source data into dev/test sets is problematic.
According to Andrew Ng, what is the primary criterion when choosing dev and test sets?
Randomly shuffling 5,000 user images with 205,000 internet images produces a dev/test set that accurately reflects the app-user distribution.
Andrew Ng recommends choosing dev and test sets to reflect data you expect to get in the _____ and want to do well on.
Match each data-partitioning scenario to the corresponding recommendation or outcome from Machine Learning Yearning.
Order the steps for correctly partitioning mixed-source data so that dev/test sets reflect the target distribution.
Analyze the consequences of shuffling mixed-source data into the dev and test sets.
Evaluate the data partitioning strategy for a cat-app using mixed-source data.
Why must dev/test sets reflect the target distribution instead of a shuffled mix?