1Cademy - Diagnosing a drop in test set performance with mismatched distributions.

Learn Before

Choosing Dev and Test Sets from the Same Distribution When Possible

Case Study

Diagnosing a drop in test set performance with mismatched distributions.

Case context: A team builds a speech recognition system. They optimize their model's parameters and achieve excellent performance on their development set. However, when they evaluate the final model on their test set, they observe a significant drop in accuracy. They then check their data sources and find that the dev set and test set were collected from different distributions.

Question: Given that the dev and test sets come from different distributions, explain why the team cannot make a definitive diagnosis about why the model failed on the test set, and list the possible explanations they must consider.

Sample answer: Because the dev and test sets come from different distributions, the diagnosis of the poor test set performance is ambiguous. The team cannot definitively isolate the issue. They must consider three possibilities: they may have overfit the dev set, the test set may simply be harder than the dev set, or their algorithm might be performing as well as could be expected.

Key points:

Different dev and test distributions make diagnosing poor test performance ambiguous.
The team may have overfit the model to the dev set.
The test set might be harder than the dev set.
The algorithm might be doing as well as could be expected.

Rubric: The answer should explain that different distributions make the diagnosis ambiguous or unclear. It must list the three possible explanations grounded in the source text: 1) Overfitting the dev set, 2) The test set being harder than the dev set, and 3) The algorithm doing as well as could be expected.

Updated 2026-06-17

Contributors are:

Who are from:

References

Learn Before

Related