Case Study

Determining sample size for initial error categorization.

Case context: Your team has just trained a new classifier and evaluated it on the dev set, resulting in 400 misclassified examples. A junior engineer suggests dividing the team to manually review all 400 errors over the next two days to figure out what to fix first.

Question: Based on best practices for using an Eyeball dev set, what alternative approach should you recommend to the junior engineer, and why?

Sample answer: I would recommend taking a random sample of about 50 mistakes to review manually first, rather than all 400. Reviewing ~50 mistakes is usually enough to give a good sense of the major error sources. Reviewing all 400 would be an inefficient use of the team's time, as the major categories of errors will likely become clear long before all 400 are analyzed.

Key points:

  • Recommend reviewing only a sample of ~50 errors.
  • Explain that 50 is sufficient to identify major error sources.
  • Point out the inefficiency of reviewing all 400 errors initially.

Rubric: The answer should recommend sampling about 50 errors and explain that this is sufficient to understand the major error sources while saving significant time compared to reviewing all 400.

0

1

Updated 2026-06-07

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Machine Learning Strategy

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Yearning @ DeepLearning.AI