When data distributions shift between training and testing in arbitrary, unconstrained ways, learning a robust classifier is fundamentally impossible. For instance, in a binary classification task like distinguishing cats from dogs, if the input distribution $$p_S(\mathbf{x}) = p_T(\mathbf{x})$$ remains exactly the same but all labels are deterministically flipped such that $$p_S(y \mid \mathbf{x}) = 1 - p_T(y \mid \mathbf{x})$$, an algorithm cannot distinguish this pathological scenario from one where the distribution never changed at all.

Claude

Standard generalization guarantees in machine learning rely heavily on the independent and identically distributed (IID) assumption. If this assumption is relaxed—meaning that the underlying data distributions shift between the training phase and the testing phase—then no claims can be made about the model's ability to generalize to new data unless additional assumptions are established. Specifically, if the training data is sampled from a distribution $$p_S(\mathbf{x},y)$$ and the test data from a different distribution $$p_T(\mathbf{x},y)$$, learning a robust classifier is impossible absent any assumptions on how the distributions relate. Fortunately, under restricted assumptions about the shift, principled algorithms can detect and adapt to changes.

Distribution Shift

Dive into Deep Learning

Arbitrary Distribution Shift

Covariate shift is a category of distribution shift where the marginal distribution of the input features (covariates) changes over time, while the conditional distribution of the labels given the inputs, $$P(y \mid \mathbf{x})$$, remains constant. It is the most natural assumption to make in settings where the input features $$\mathbf{x}$$ are believed to cause the label $$y$$. An example of covariate shift is training a classifier to distinguish cats and dogs using real photographs, but testing it exclusively on cartoon images.

Covariate Shift

Label shift describes a distribution shift scenario where the marginal distribution of labels, $$P(y)$$, changes across domains, but the class-conditional distribution of inputs given labels, $$P(\mathbf{x} \mid y)$$, remains fixed. This assumption is appropriate when the label $$y$$ is believed to cause the input features $$\mathbf{x}$$. A classic example is predicting medical diagnoses from symptoms: the relative prevalence of certain diseases (labels) may change over time, but the underlying symptoms (inputs) caused by those diseases do not.

Label Shift

Concept shift occurs when the actual definitions of the target labels change, leading to a shift in the conditional distribution $$P(y \mid \mathbf{x})$$. This happens because categories and terms can evolve in their usage over time or vary geographically. For instance, the criteria for psychiatric diagnoses can change over decades, or the regional word used for soft drinks ('pop' versus 'soda') can differ across the United States, altering the labeling function in machine translation systems.

Concept Shift

A nonstationary distribution occurs when the underlying probability distribution of the data changes slowly over time. In machine learning systems, if the deployment distribution drifts steadily and the model is not updated adequately to reflect these changes, its predictive performance will gradually deteriorate.

Nonstationary Distribution

A self-driving car detector trained on synthetic data from a game engine might learn a simplistic, uniform rendering texture instead of genuine roadside features, failing disastrously in real cars due to the distribution shift between the synthetic training data and the real-world deployment environment.

Self-Driving Cars Example of Distribution Shift

A classifier trained to detect military tanks in a forest might simply learn to distinguish the time of day—identifying shadows present at noon versus early morning—if the tank images and background images were captured at different times, demonstrating how models can exploit spurious features caused by distribution shift.

Tank Detection Example of Distribution Shift

A face detector that performs well on standard benchmarks may fail in real-world deployment when encountering test data with extreme close-ups where the face fills the entire image, illustrating a distribution shift from a training set that lacked such examples.

Face Detection Example of Distribution Shift

A web search engine optimized specifically for the vocabulary, culture, and user behavior of the US market may suffer a drop in performance when deployed in the UK due to the distribution shift between the two regional user bases.

Web Search Example of Distribution Shift

An image classifier trained on an artificially balanced dataset, where each category is equally represented, will encounter a distribution shift when deployed in the real world where the actual label frequencies are highly skewed and non-uniform.

Learn Before

Related