Case Study

Troubleshooting a Performance Drop on the Test Set

Case context: Your team has built a system that achieves excellent accuracy on your dev set. However, when evaluating it on the test set, the performance drops significantly. You are aware that your dev and test sets were drawn from entirely different distributions.

Question: Based on Machine Learning Yearning, what are the three possible things that could have gone wrong in this scenario? Furthermore, in which of these three scenarios might you conclude that no further significant improvement is possible?

Sample answer: The three possible things that could have gone wrong are: 1) the model overfit to the dev set, 2) the test set is harder than the dev set, or 3) the test set is just different from the dev set. If the test set is simply harder than the dev set, you might conclude that your algorithm is already doing as well as could be expected, and no further significant improvement is possible.

Key points:

  • Overfit to the dev set
  • Test set is harder
  • Test set is different
  • No further improvement might be possible if the test set is harder.

Rubric: The response must list the three specific causes for the performance drop and correctly attribute the potential impossibility of further improvement to the scenario where the test set is inherently harder.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Machine Learning Strategy

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Yearning @ DeepLearning.AI

Related