Essay

Why Standard Data Splits Fail With Different Future Distributions

Question: Explain why simply designating 30% of your available training data as a test set is problematic when you expect future data to differ in nature from your training data. Describe how you should choose your dev and test sets under these conditions according to Machine Learning Yearning.

Sample answer: A standard 30% split assumes the training distribution is the same as the test distribution. If future data differs from training data, the resulting test set will only reflect the training distribution rather than what the system expects to receive in the future. According to Machine Learning Yearning, dev and test sets should be chosen to reflect the data one expects to get in the future and wants to perform well on, rather than simply whatever data happens to be available for training.

Key points:

  • A 30% random split fails to capture differing future distributions.
  • Training and test distributions should not be assumed to be the same.
  • Dev and test sets must reflect future data that the system needs to perform well on.

Rubric: The answer should explain that a 30% split fails because it assumes the training and test distributions are identical, and explain that dev/test sets must reflect the future data one expects to get and wants to perform well on.

0

1

Updated 2026-05-26

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Related