Case Study

Diagnosing divergent performance between Eyeball and Blackbox dev sets in a speech recognition system.

Case context: You are leading a team working on a voice-activated smart speaker. To improve accuracy, you split your dev set into an Eyeball dev set (which you inspect manually to diagnose error categories) and a Blackbox dev set (which you only use for evaluation). After three weeks of tuning features and model parameters based on your manual error analysis, you run evaluations. The error rate on the Eyeball dev set has dropped from 15% to 5%, but the error rate on the Blackbox dev set has only dropped from 15% to 14%.

Question: Based on these results, diagnose what has occurred with your development data and explain what decision or next steps you should consider regarding the dev sets.

Sample answer: The diagnosis is that the Eyeball dev set has been overfit due to the manual error analysis process. This is indicated by the error rate on the Eyeball dev set improving much more rapidly than on the Blackbox dev set. To address this, the team should recognize that the Eyeball dev set is no longer representative and should consider acquiring more data for it.

Key points:

  • Diagnose that the Eyeball dev set has been overfit.
  • Explain that the rapid improvement in Eyeball performance relative to Blackbox performance is the primary signal.
  • Suggest acquiring more data for the Eyeball dev set as a potential remedy.

Rubric: The response must correctly diagnose that the Eyeball dev set has been overfit based on the rapid improvement compared to the stagnant Blackbox dev set, and propose acquiring more data for the Eyeball dev set as a resolution.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI

Related