1Cademy - Diagnosing Performance with Target-Distribution Data

Learn Before

Include Some Target-Distribution Examples in Training Alongside Auxiliary Data

Case Study

Diagnosing Performance with Target-Distribution Data

Case context: You are building a speech recognition system for a voice-controlled mobile navigation app. You have 500,000 general audio clips (auxiliary data) and 20,000 clips of users speaking street addresses (target distribution). You decide to use 10,000 street address clips for the dev/test sets. The remaining 10,000 street address clips are combined with the 500,000 general clips for training. During training, you monitor performance specifically on the street address data within your training set and a separate training dev set of street addresses.

Question: If your system achieves high accuracy on the 10,000 street address examples in the training set, but poor accuracy on the street address examples in your training dev set, what should you diagnose as the primary issue, and what actionable decision should you make based on this validation?

Sample answer: The high accuracy on the street address training data combined with poor accuracy on the street address training dev data indicates that the model is overfitting specifically to the limited target-distribution data it was trained on. This outcome validates the hypothesis that the model needs more data from the target distribution to generalize well. The actionable decision is to focus efforts on acquiring more street address audio clips to expand the target-distribution training data.

Key points:

Diagnose overfitting to the available target-distribution training examples.
Validate the hypothesis that more target-distribution data is necessary.
Decide to acquire more data matching the dev/test distribution (street addresses).

Rubric: The response must diagnose the issue as overfitting to the available target-distribution data and decide that the team should acquire more data from that specific distribution.

0

1

Updated 2026-06-13

Contributors are:

Who are from:

References

Learn Before

Related