Case Study

Deciding whether to collect more training data

Case context: A machine learning team has measured their algorithm's training and dev set error using their entire dataset of 10,000 examples (representing the rightmost point of their data). They are unsure if they should spend a month collecting 5,000 more examples.

Question: What visual tool should the team construct, and what specific curves should it include to help them make a confident decision about data collection?

Sample answer: The team should plot a full learning curve showing both the training error curve and the dev error curve across different subset sizes of their data (e.g., 2,000, 4,000, 6,000, 8,000, and 10,000). By examining the full curves together on the same plot rather than just the final point, they can more confidently extrapolate the dev error curve's trajectory to see if more data is likely to lower the error.

Key points:

  • Construct a full learning curve across different training set sizes.
  • Include both the training error curve and the dev error curve on the same plot.
  • Extrapolate the dev error curve visually to inform the decision.

Rubric: The answer must propose plotting a full learning curve across different dataset sizes, rather than relying on the single rightmost point. It must also specify plotting both training and dev error to allow for dev error extrapolation.

0

1

Updated 2026-06-13

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI