Essay

Diagnostic Ambiguity with Mismatched Dev/Test Distributions

Question: In machine learning projects where the dev and test sets are drawn from different distributions, a model might perform exceptionally well on the dev set but exhibit poor performance on the test set. Discuss the three possible explanations for this performance gap as identified in Machine Learning Yearning, and explain why this scenario leaves your options for optimization unclear.

Sample answer: When a model works well on the dev set but poorly on a test set from a different distribution, there are three primary explanations. First, the model may have overfit to the dev set. Second, the test set might simply be harder than the dev set, which could mean the algorithm is already doing as well as can be expected. Third, the test set may not be harder, but merely different, meaning the features that work for the dev distribution do not apply to the test distribution. This creates diagnostic ambiguity because it is unclear whether the team should focus on reducing overfitting, improving general capabilities, or collecting more relevant data.

Key points:

  • The model overfit to the dev set.
  • The test set is harder than the dev set.
  • The test set is just different from the dev set.
  • It is unclear which problem to fix to improve performance.

Rubric: The essay should correctly identify the three potential causes (overfitting to dev, harder test set, different test set) and explain how the inability to distinguish between them makes the path to further improvement unclear.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Machine Learning Strategy

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Yearning @ DeepLearning.AI

Related