Troubleshooting a Performance Drop on the Test Set
Case context: Your team has built a system that achieves excellent accuracy on your dev set. However, when evaluating it on the test set, the performance drops significantly. You are aware that your dev and test sets were drawn from entirely different distributions.
Question: Based on Machine Learning Yearning, what are the three possible things that could have gone wrong in this scenario? Furthermore, in which of these three scenarios might you conclude that no further significant improvement is possible?
Sample answer: The three possible things that could have gone wrong are: 1) the model overfit to the dev set, 2) the test set is harder than the dev set, or 3) the test set is just different from the dev set. If the test set is simply harder than the dev set, you might conclude that your algorithm is already doing as well as could be expected, and no further significant improvement is possible.
Key points:
- Overfit to the dev set
- Test set is harder
- Test set is different
- No further improvement might be possible if the test set is harder.
Rubric: The response must list the three specific causes for the performance drop and correctly attribute the potential impossibility of further improvement to the scenario where the test set is inherently harder.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Machine Learning Strategy
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Yearning @ DeepLearning.AI
Related
Mismatched Dev/Test Sets Can Waste Dev-Set Optimization Effort
Which is NOT listed in Machine Learning Yearning as a possible cause when a model does well on dev but poorly on the test set (different distributions)?
True or False: When dev and test sets come from different distributions, diagnosing why a model underperforms on the test set is straightforward.
Machine Learning Yearning warns that if dev and test sets come from different _____, a gap in performance leaves the cause of failure unclear.
Match each possible failure cause (when dev and test distributions differ) to its correct description from Machine Learning Yearning.
Order the three possible failure causes as they appear in Machine Learning Yearning when a model succeeds on dev but fails on test with mismatched distributions.
According to Machine Learning Yearning, what is the key implication if the test set is harder than the dev set when the two sets have different distributions?
True or False: According to Machine Learning Yearning, a lower test-set score compared to dev always means the test set is objectively harder.
Machine Learning Yearning states: 'So what works well on the _____ set just does not work well on the test set.'
Match each failure diagnosis (mismatched dev/test distributions) to the corrective implication it would suggest for a practitioner.
Order the reasoning steps that lead a practitioner to recognize diagnostic ambiguity when dev and test sets come from different distributions.
Diagnostic Ambiguity with Mismatched Dev/Test Distributions
Troubleshooting a Performance Drop on the Test Set
Three Causes of Poor Test Performance