1Cademy - Diagnosing a Performance Gap After Development

Learn Before

Dev Set Overfitting from Repeated Evaluation

Case Study

Diagnosing a Performance Gap After Development

Case context: Your team has just finished a three-month development cycle for a new image classification algorithm. Throughout the process, the team continuously checked performance against the dev set, tweaking the architecture to maximize accuracy. When development concluded, the algorithm achieved 98% accuracy on the dev set but only 82% on the test set.

Question: Based on the provided scenario, what specific problem should you diagnose regarding your dataset evaluation, and what is the recommended next step to address this issue?

Sample answer: The team should diagnose that the algorithm has gradually overfit to the dev set due to repeated evaluation during the three-month cycle. This is evident from the massive performance gap (98% vs 82%) between the dev and test sets. The recommended next step is to get a fresh dev set to replace the one that the algorithm has overfit to.

Key points:

Diagnose dev-set overfitting.
Identify that the large gap between dev (98%) and test (82%) performance indicates the overfitting.
Recommend getting a fresh dev set.

Rubric: Full credit is given for diagnosing dev-set overfitting due to repeated evaluations and prescribing the acquisition of a new dev set.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

References

Learn Before

Related