Evaluate the data partitioning strategy for a cat-app using mixed-source data.
Case context: You are building a cat-app and have collected 205,000 images from the internet and 5,000 images uploaded by users. A team member suggests randomly shuffling all 210,000 images together before splitting them into train, dev, and test sets to ensure that all three sets come from the exact same distribution.
Question: Diagnose the flaw in this partitioning strategy. What will be the composition of the dev/test sets, and how does this affect your team's ability to optimize for the target distribution?
Sample answer: The flaw in this strategy is that shuffling the mixed-source data results in dev and test sets where approximately 97.6% of the images are from the internet. This composition does not reflect the target app-user distribution that the application actually needs to perform well on. Consequently, the team's optimization efforts will be misdirected toward performing well on internet images rather than user-uploaded images.
Key points:
- Random shuffling makes dev/test sets consist of about 97.6% internet images.
- Internet images do not reflect the target app-user distribution.
- The team will optimize for the wrong distribution (internet images instead of user images).
Rubric: The response must identify that the dev/test sets will fail to reflect the target app-user distribution, calculate or state that approximately 97.6% of dev/test images will be internet images, and explain that optimization will focus on the wrong distribution.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Related
What is the main problem with randomly shuffling user and internet images together into dev/test sets for a cat app?
True or False: Randomly shuffling all available data into train/dev/test sets is recommended when your data sources have different distributions.
Dev and test sets should be chosen to reflect the _____ you expect to encounter in the future, not the overall shuffled data pool.
In the cat-app example, what percentage of dev/test data comes from internet images if all 210,000 images are randomly shuffled?
Andrew Ng recommends randomly shuffling all available data into dev/test sets even when the data sources differ from the target distribution.
In the cat-app example, randomly shuffling all data means about _____% of dev/test images would come from internet sources.
Match each dev/test set characteristic to the consequence it produces for the ML team.
Order the reasoning steps that explain why randomly shuffling mixed-source data into dev/test sets is problematic.
According to Andrew Ng, what is the primary criterion when choosing dev and test sets?
Randomly shuffling 5,000 user images with 205,000 internet images produces a dev/test set that accurately reflects the app-user distribution.
Andrew Ng recommends choosing dev and test sets to reflect data you expect to get in the _____ and want to do well on.
Match each data-partitioning scenario to the corresponding recommendation or outcome from Machine Learning Yearning.
Order the steps for correctly partitioning mixed-source data so that dev/test sets reflect the target distribution.
Analyze the consequences of shuffling mixed-source data into the dev and test sets.
Evaluate the data partitioning strategy for a cat-app using mixed-source data.
Why must dev/test sets reflect the target distribution instead of a shuffled mix?