1Cademy - Analyze the consequences of shuffling mixed-source data into the dev and test sets.

Learn Before

Avoid Randomly Shuffling Mixed-Source Data into Dev/Test Sets

Essay

Analyze the consequences of shuffling mixed-source data into the dev and test sets.

Question: Based on the cat-app example, analyze the consequences of randomly shuffling all 210,000 available images (consisting of 205,000 internet images and 5,000 user images) into the train, dev, and test sets. Why does this approach fail to align with the core recommendation for choosing dev and test sets?

Sample answer: Randomly shuffling all 210,000 available images makes the train, dev, and test sets come from the same distribution. However, because internet images make up the vast majority of the data (205,000 out of 210,000), approximately 97.6% of the dev and test sets will consist of internet images. This fails to reflect the actual app-user distribution (the target distribution we expect to get in the future and want to do well on), meaning the team will optimize the model for internet images rather than actual user images.

Key points:

Shuffling all available data forces the train, dev, and test sets to come from the same distribution.
About 97.6% (205,000 out of 210,000) of the dev/test data would come from internet images.
The resulting dev/test sets fail to reflect the target app-user distribution.
It violates the recommendation to choose dev and test sets reflecting data expected in the future and want to do well on.

Rubric: The answer must explain that shuffling results in dev/test sets dominated by internet images (approx. 97.6%), which fails to reflect the target user distribution, and explain why this violates the rule to choose sets reflecting future target data.

Updated 2026-06-12

Contributors are:

Who are from:

References

Learn Before

Related