Concept

Avoid Randomly Shuffling Mixed-Source Data into Dev/Test Sets

When available data sources differ from the distribution one cares about, randomly shuffling all available data into train/dev/test sets can make the dev/test sets fail to reflect the target distribution. In the cat-app example, shuffling user images together with many more internet images would make about 97.6% of dev/test data internet images, so it would not reflect the app-user distribution.

0

1

Updated 2026-05-26

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Learn After