1Cademy - Diagnose the evaluation issue when a team corrects dog breed labels in a development set but leaves the test set unchanged.

Learn Before

Fix Dev and Test Set Labels Together

Case Study

Diagnose the evaluation issue when a team corrects dog breed labels in a development set but leaves the test set unchanged.

Case context: A team is developing a classifier to identify dog breeds. During error analysis, they find that 5% of the development set images are mislabeled, so they hire experts to correct all labels in the development set. However, to save budget, they decide not to inspect or correct the labels in the test set. They train their classifier, achieving 98% accuracy on the corrected development set, but only 91% on the test set.

Question: Diagnose why there is a discrepancy in performance and decide what action the team should take to ensure their evaluation is valid.

Sample answer: The discrepancy in performance is caused by the dev and test sets no longer being drawn from the same distribution because only the dev set labels were corrected. The team optimized the classifier for the corrected labels, but it was judged on the uncorrected test set labels. To fix this, the team must apply the exact same label-correcting process to the test set labels so that both sets continue to be drawn from the same distribution and use the same evaluation criterion.

Key points:

Identify that the dev and test sets are drawn from different distributions due to inconsistent label correction.
Explain that the team optimized for dev set performance only to be judged on a different test set criterion.
Recommend applying the same label-correcting process to the test set labels.

Rubric: The response must identify that the dev and test sets are no longer drawn from the same distribution because the label correction was only applied to one of them. It must recommend applying the same label-fixing process to the test set labels to align the evaluation criteria.

0

1

Updated 2026-06-07

Contributors are:

Who are from:

References

Learn Before

Related