Diagnostic Ambiguity with Mismatched Dev/Test Distributions
Question: In machine learning projects where the dev and test sets are drawn from different distributions, a model might perform exceptionally well on the dev set but exhibit poor performance on the test set. Discuss the three possible explanations for this performance gap as identified in Machine Learning Yearning, and explain why this scenario leaves your options for optimization unclear.
Sample answer: When a model works well on the dev set but poorly on a test set from a different distribution, there are three primary explanations. First, the model may have overfit to the dev set. Second, the test set might simply be harder than the dev set, which could mean the algorithm is already doing as well as can be expected. Third, the test set may not be harder, but merely different, meaning the features that work for the dev distribution do not apply to the test distribution. This creates diagnostic ambiguity because it is unclear whether the team should focus on reducing overfitting, improving general capabilities, or collecting more relevant data.
Key points:
- The model overfit to the dev set.
- The test set is harder than the dev set.
- The test set is just different from the dev set.
- It is unclear which problem to fix to improve performance.
Rubric: The essay should correctly identify the three potential causes (overfitting to dev, harder test set, different test set) and explain how the inability to distinguish between them makes the path to further improvement unclear.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Machine Learning Strategy
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Yearning @ DeepLearning.AI
Related
Mismatched Dev/Test Sets Can Waste Dev-Set Optimization Effort
Which is NOT listed in Machine Learning Yearning as a possible cause when a model does well on dev but poorly on the test set (different distributions)?
True or False: When dev and test sets come from different distributions, diagnosing why a model underperforms on the test set is straightforward.
Machine Learning Yearning warns that if dev and test sets come from different _____, a gap in performance leaves the cause of failure unclear.
Match each possible failure cause (when dev and test distributions differ) to its correct description from Machine Learning Yearning.
Order the three possible failure causes as they appear in Machine Learning Yearning when a model succeeds on dev but fails on test with mismatched distributions.
According to Machine Learning Yearning, what is the key implication if the test set is harder than the dev set when the two sets have different distributions?
True or False: According to Machine Learning Yearning, a lower test-set score compared to dev always means the test set is objectively harder.
Machine Learning Yearning states: 'So what works well on the _____ set just does not work well on the test set.'
Match each failure diagnosis (mismatched dev/test distributions) to the corrective implication it would suggest for a practitioner.
Order the reasoning steps that lead a practitioner to recognize diagnostic ambiguity when dev and test sets come from different distributions.
Diagnostic Ambiguity with Mismatched Dev/Test Distributions
Troubleshooting a Performance Drop on the Test Set
Three Causes of Poor Test Performance