1Cademy - Evaluate the data partitioning strategy for a cat-app using mixed-source data.

Learn Before

Avoid Randomly Shuffling Mixed-Source Data into Dev/Test Sets

Case Study

Evaluate the data partitioning strategy for a cat-app using mixed-source data.

Case context: You are building a cat-app and have collected 205,000 images from the internet and 5,000 images uploaded by users. A team member suggests randomly shuffling all 210,000 images together before splitting them into train, dev, and test sets to ensure that all three sets come from the exact same distribution.

Question: Diagnose the flaw in this partitioning strategy. What will be the composition of the dev/test sets, and how does this affect your team's ability to optimize for the target distribution?

Sample answer: The flaw in this strategy is that shuffling the mixed-source data results in dev and test sets where approximately 97.6% of the images are from the internet. This composition does not reflect the target app-user distribution that the application actually needs to perform well on. Consequently, the team's optimization efforts will be misdirected toward performing well on internet images rather than user-uploaded images.

Key points:

Random shuffling makes dev/test sets consist of about 97.6% internet images.
Internet images do not reflect the target app-user distribution.
The team will optimize for the wrong distribution (internet images instead of user images).

Rubric: The response must identify that the dev/test sets will fail to reflect the target app-user distribution, calculate or state that approximately 97.6% of dev/test images will be internet images, and explain that optimization will focus on the wrong distribution.

Updated 2026-06-18

Contributors are:

Who are from:

References

Learn Before

Related