When applying batch normalization, the choice of minibatch size is highly significant. For fully connected layers, if batch normalization is applied with a minibatch of size $$1$$, the network cannot learn because subtracting the mean causes each hidden unit to take a value of $$0$$. Therefore, a suitably large minibatch is required for stable training. However, in the context of convolutional layers, batch normalization remains well-defined even for minibatches of size $$1$$, because the mean and variance are computed simultaneously across all spatial locations within the single image observation.

Claude

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

Batch Normalization

Dive into Deep Learning

Here is a useful website about Batch Norm: https://www.cerebras.net/neurips-2019-online-normalization-for-training-neural-networks/

NeurIPS 2019: Online Normalization for Training Neural Networks

Batch Normalization and Batch Size

Batch normalization conveys three primary benefits during the training of deep networks: preprocessing, numerical stability, and regularization. First, similar to feature standardization, it puts parameters on a similar scale which is favorable for optimizers. Second, it provides numerical stability by preventing intermediate activations from taking widely varying magnitudes across layers and over time. Finally, the use of noisy estimates for the mean and variance injects noise into the optimization process, which acts as a serendipitous form of regularization that reduces overfitting.

Benefits of Batch Normalization

Batch normalization is applied to individual layers by standardizing the inputs based on the statistics of the current minibatch $$\mathcal{B}$$. For an input $$\mathbf{x} \in \mathcal{B}$$, the batch normalization $$\textrm{BN}$$ is defined as:

$$ \textrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}}}{\hat{\boldsymbol{\sigma}}_{\mathcal{B}}} + \boldsymbol{\beta} $$

Here, $$\hat{\boldsymbol{\mu}}_{\mathcal{B}} = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x}$$ is the sample mean, and $$\hat{\boldsymbol{\sigma}}_{\mathcal{B}}^2 = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon$$ is the sample variance with a small constant $$\epsilon > 0$$ added for numerical stability to prevent division by zero. The parameters $$\boldsymbol{\gamma}$$ (scale parameter) and $$\boldsymbol{\beta}$$ (shift parameter) are learned during training to recover the degrees of freedom lost due to standardization.

Batch Normalization Formula

Layer normalization is a technique that standardizes the activations of a deep network by applying the normalization to one observation at a time, rather than across a minibatch. For an $$ n $$-dimensional input vector $$ \mathbf{x} $$, the layer normalization operation is defined as:
$$ \textrm{LN}(\mathbf{x}) = \frac{\mathbf{x} - \hat{\mu}}{\hat{\sigma}} $$
where the scalar mean $$ \hat{\mu} $$ and the scalar variance $$ \hat{\sigma}^2 $$ are computed across the features of the single observation:
$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i \quad \textrm{and} \quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2 + \epsilon $$
A small constant $$ \epsilon > 0 $$ is added to prevent division by zero. Because it operates on a single observation, both the offset and the scaling factor in layer normalization are scalars.

Layer Normalization

Although batch normalization is widely adopted for its regularization and convergence benefits, research by Wang et al. (2022) has shown that removing batch normalization from a network can improve adversarial robustness. Models without batch normalization layers tend to be less sensitive to small adversarial input perturbations, suggesting that while batch normalization enhances standard training performance, it may introduce vulnerabilities that adversaries can exploit. Practitioners who prioritize building robust models resistant to adversarial attacks should therefore consider architectures that omit batch normalization entirely.

Learn Before

Related