1Cademy - Diagnose why a learning curve is noisy for a rare disease classifier and propose a fix.

Learn Before

Balanced Subsets for Noisy Learning Curves in Skewed or Many-Class Data

Case Study

Diagnose why a learning curve is noisy for a rare disease classifier and propose a fix.

Case context: An engineer is training a model to detect a rare disease where only 5% of the patient records in the training set are positive examples. To plot a learning curve, the engineer extracts training subsets of sizes 20, 50, 100, and 500 at random from the training set. The resulting learning curve is extremely noisy and jagged, particularly at the smaller subset sizes of 20 and 50.

Question: Diagnose the source of the noise in the learning curve at small subset sizes, and decide on a specific sampling adjustment the engineer should make for these subsets. Calculate the exact number of positive examples that should be present in the subset of size 20 under your proposed adjustment.

Sample answer: The noise at small subset sizes occurs because random sampling from a dataset with a 5% positive class rate leads to high variance in the number of positive examples in small subsets (e.g., a subset of size 20 may contain 0, 1, or 2 positive examples by chance, causing massive fluctuations in model training). The engineer should instead use balanced subsets where the class fractions match the original dataset. For a subset of size 20, 5% of the examples should be positive, meaning it should contain exactly 1 positive example and 19 negative examples.

Key points:

Random sampling of small subsets from a skewed dataset (5% positive) causes high variance in class representation.
The class ratio instability in small random subsets translates into a noisy, jagged learning curve.
The engineer should use balanced subsets where class fractions match the original training set's fractions.
A balanced subset of size 20 must contain exactly 1 positive example to maintain the 5% ratio.

Rubric: The user must diagnose that random sampling causes high variance in class composition for small subsets from skewed data, creating noise. They must decide to use balanced subsets. They must correctly calculate that a subset of size 20 must have exactly 1 positive example (5% of 20) to match the original training set ratio.

Updated 2026-06-17

Contributors are:

Who are from:

References

Learn Before

Related