Diagnose why a learning curve is noisy for a rare disease classifier and propose a fix.
Case context: An engineer is training a model to detect a rare disease where only 5% of the patient records in the training set are positive examples. To plot a learning curve, the engineer extracts training subsets of sizes 20, 50, 100, and 500 at random from the training set. The resulting learning curve is extremely noisy and jagged, particularly at the smaller subset sizes of 20 and 50.
Question: Diagnose the source of the noise in the learning curve at small subset sizes, and decide on a specific sampling adjustment the engineer should make for these subsets. Calculate the exact number of positive examples that should be present in the subset of size 20 under your proposed adjustment.
Sample answer: The noise at small subset sizes occurs because random sampling from a dataset with a 5% positive class rate leads to high variance in the number of positive examples in small subsets (e.g., a subset of size 20 may contain 0, 1, or 2 positive examples by chance, causing massive fluctuations in model training). The engineer should instead use balanced subsets where the class fractions match the original dataset. For a subset of size 20, 5% of the examples should be positive, meaning it should contain exactly 1 positive example and 19 negative examples.
Key points:
- Random sampling of small subsets from a skewed dataset (5% positive) causes high variance in class representation.
- The class ratio instability in small random subsets translates into a noisy, jagged learning curve.
- The engineer should use balanced subsets where class fractions match the original training set's fractions.
- A balanced subset of size 20 must contain exactly 1 positive example to maintain the 5% ratio.
Rubric: The user must diagnose that random sampling causes high variance in class composition for small subsets from skewed data, creating noise. They must decide to use balanced subsets. They must correctly calculate that a subset of size 20 must have exactly 1 positive example (5% of 20) to match the original training set ratio.
0
1
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Related
Why use a balanced subset instead of a random subset when drawing small training sets for learning curves on skewed data?
True or False: On skewed or many-class data, balanced subsets—where each class fraction mirrors the original dataset—produce less noisy learning curves than purely random subsets.
On skewed or multi-class training data, you should choose a _____ subset so that each class fraction matches the original training set as closely as possible.
In which situation does Andrew Ng recommend using a balanced subset when constructing training sets for learning curves?
A balanced subset for learning curves ensures each class appears in proportion to its share of the full training set.
To reduce noise in learning curves on skewed or many-class data, Andrew Ng recommends sampling a _____ subset instead of a purely random one.
Match each term related to balanced subset sampling with its correct description.
Order the steps for constructing a single balanced training subset to plot one point on a learning curve.
What is the primary benefit of using balanced subsets when plotting learning curves on skewed or many-class data?
Random sampling of small training subsets always produces smooth learning curves regardless of class distribution.
If 20% of the original training set is positive examples and you draw a balanced subset of 10, you should include _____ positive examples.
Match each data condition to the sampling problem it causes when constructing small learning curve subsets.
Order the reasoning steps for deciding whether and how to apply balanced subset sampling for a learning curve.
Explain how class imbalance affects learning curves and how balanced subsets resolve this.
Diagnose why a learning curve is noisy for a rare disease classifier and propose a fix.
State the rule for determining class distribution in balanced learning curve subsets.