1Cademy - Troubleshooting a Performance Drop on the Test Set

Learn Before

Different Dev/Test Distributions Make Failure Diagnosis Ambiguous

Case Study

Troubleshooting a Performance Drop on the Test Set

Case context: Your team has built a system that achieves excellent accuracy on your dev set. However, when evaluating it on the test set, the performance drops significantly. You are aware that your dev and test sets were drawn from entirely different distributions.

Question: Based on Machine Learning Yearning, what are the three possible things that could have gone wrong in this scenario? Furthermore, in which of these three scenarios might you conclude that no further significant improvement is possible?

Sample answer: The three possible things that could have gone wrong are: 1) the model overfit to the dev set, 2) the test set is harder than the dev set, or 3) the test set is just different from the dev set. If the test set is simply harder than the dev set, you might conclude that your algorithm is already doing as well as could be expected, and no further significant improvement is possible.

Key points:

Overfit to the dev set
Test set is harder
Test set is different
No further improvement might be possible if the test set is harder.

Rubric: The response must list the three specific causes for the performance drop and correctly attribute the potential impossibility of further improvement to the scenario where the test set is inherently harder.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

References

Learn Before

Related