1Cademy - Debate over Internal Covariate Shift as Explanation for Batch Normalization

Learn Before

Benefits of Batch Normalization

Concept

Debate over Internal Covariate Shift as Explanation for Batch Normalization

The original batch normalization paper (Ioffe and Szegedy, 2015) attributed the technique's effectiveness to reducing internal covariate shift—the idea that the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. However, this explanation has been widely challenged on two grounds. First, the phenomenon described is more accurately characterized as concept drift, in which the underlying functional relationship between inputs and outputs shifts rather than merely the marginal distribution of inputs, making the original terminology a misnomer. Second, the explanation provides only a vague intuition rather than a rigorous mechanism for why batch normalization works. Subsequent research (Santurkar et al., 2018) has proposed that batch normalization's success may instead stem from smoothing the optimization landscape, and that the technique can even succeed despite exhibiting behavior opposite to what the internal covariate shift hypothesis predicts. The debate was further highlighted when Ali Rahimi used internal covariate shift as a focal example in his 2017 NeurIPS Test of Time Award speech, likening modern deep learning practice to alchemy—a critique later formalized in a position paper by Lipton and Steinhardt (2018) on troubling trends in machine learning scholarship. The textbook urges practitioners to carefully separate such guiding intuitions from established scientific fact, particularly when writing research papers, noting that even the generalization ability of simpler deep neural networks (MLPs and conventional CNNs) is not yet well understood from a learning-theoretic perspective.