Here is a useful website about Batch Norm: https://www.cerebras.net/neurips-2019-online-normalization-for-training-neural-networks/

NeurIPS 2019: Online Normalization for Training Neural Networks

When applying batch normalization, the choice of minibatch size is highly significant. For fully connected layers, if batch normalization is applied with a minibatch of size $$1$$, the network cannot learn because subtracting the mean causes each hidden unit to take a value of $$0$$. Therefore, a suitably large minibatch is required for stable training. However, in the context of convolutional layers, batch normalization remains well-defined even for minibatches of size $$1$$, because the mean and variance are computed simultaneously across all spatial locations within the single image observation.

Batch Normalization and Batch Size

Batch normalization conveys three primary benefits during the training of deep networks: preprocessing, numerical stability, and regularization. First, similar to feature standardization, it puts parameters on a similar scale which is favorable for optimizers. Second, it provides numerical stability by preventing intermediate activations from taking widely varying magnitudes across layers and over time. Finally, the use of noisy estimates for the mean and variance injects noise into the optimization process, which acts as a serendipitous form of regularization that reduces overfitting.

Benefits of Batch Normalization

Batch normalization is applied to individual layers by standardizing the inputs based on the statistics of the current minibatch $$\mathcal{B}$$. For an input $$\mathbf{x} \in \mathcal{B}$$, the batch normalization $$\textrm{BN}$$ is defined as:

$$ \textrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}}}{\hat{\boldsymbol{\sigma}}_{\mathcal{B}}} + \boldsymbol{\beta} $$

Here, $$\hat{\boldsymbol{\mu}}_{\mathcal{B}} = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x}$$ is the sample mean, and $$\hat{\boldsymbol{\sigma}}_{\mathcal{B}}^2 = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon$$ is the sample variance with a small constant $$\epsilon > 0$$ added for numerical stability to prevent division by zero. The parameters $$\boldsymbol{\gamma}$$ (scale parameter) and $$\boldsymbol{\beta}$$ (shift parameter) are learned during training to recover the degrees of freedom lost due to standardization.

Batch Normalization Formula

Layer normalization is a technique that standardizes the activations of a deep network by applying the normalization to one observation at a time, rather than across a minibatch. For an $$ n $$-dimensional input vector $$ \mathbf{x} $$, the layer normalization operation is defined as:
$$ \textrm{LN}(\mathbf{x}) = \frac{\mathbf{x} - \hat{\mu}}{\hat{\sigma}} $$
where the scalar mean $$ \hat{\mu} $$ and the scalar variance $$ \hat{\sigma}^2 $$ are computed across the features of the single observation:
$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i \quad \textrm{and} \quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2 + \epsilon $$
A small constant $$ \epsilon > 0 $$ is added to prevent division by zero. Because it operates on a single observation, both the offset and the scaling factor in layer normalization are scalars.

Layer Normalization

Although batch normalization is widely adopted for its regularization and convergence benefits, research by Wang et al. (2022) has shown that removing batch normalization from a network can improve adversarial robustness. Models without batch normalization layers tend to be less sensitive to small adversarial input perturbations, suggesting that while batch normalization enhances standard training performance, it may introduce vulnerabilities that adversaries can exploit. Practitioners who prioritize building robust models resistant to adversarial attacks should therefore consider architectures that omit batch normalization entirely.

Removing Batch Normalization for Adversarial Robustness

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

University of Michigan - Ann Arbor

Claude

In the chain structure of a feedforward network, Let's assume the input example is $x \in \mathbb{R}^d$, then the first layer is given by
$h^{(1)} = g^{(1)} (W^{(1)T}x + b^{(1)})$
the second layer is given by 
$h^{(2)} = g^{(2)} (W^{(2)T} h^{(1)}+ b^{(2)})$
and so on.
For the output layer,
$\hat{y} = g^{(n)}(W^{(n)T} h^{(n-1)}+ b^{(n)})$

where the matrix $W^{(k)}\in \mathbb{R}^{d\times h_k}$ is the weight parameter for layer $k$ and $g^{k}$ is the activation function for layer $k$. Whether transposing $W^{(k)}$ depends on the shape of the vector.

List of Common Hyperparameters in Deep Learning

Normalization of training data may help reducing the bias and variance, and speeds up Gradient Descent, and consequently, training of the deep learning model, as shown in the figure.

Normalization Helps Deep Learning

Covariate shift is a category of distribution shift where the marginal distribution of the input features (covariates) changes over time, while the conditional distribution of the labels given the inputs, $$P(y \mid \mathbf{x})$$, remains constant. It is the most natural assumption to make in settings where the input features $$\mathbf{x}$$ are believed to cause the label $$y$$. An example of covariate shift is training a classifier to distinguish cats and dogs using real photographs, but testing it exclusively on cartoon images.

Covariate Shift

https://www.coursera.org/learn/deep-neural-network?specialization=deep-learning

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Dive into Deep Learning

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When used in a multilayer neural network, activation functions can be different for different layers. 

Activation Functions in Neural Networks

Matrix degeneration is a phenomenon where the rank of a matrix is reduced after it undergoes some form of processing.

Matrix Degeneration

Batch Normalization

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

Dropout means what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

Dropout

Neural network learning rate means how fast the backpropagation algorithm performs gradient descent. A larger learning rate makes the network train faster but might result in missing the minimum of the loss function.

Neural Network Learning Rate

When you try to train a huge amount of data with limited computer memory, you can separate the whole training set to several batches that can fit into your computer memory. Then you feed these batches to your model one by one. After feeding all batches once, you complete one epoch. To successfully train your model, you need multiple epochs.

Epochs in Machine Learning

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

Deep Learning Weight Initialization

- Grid search
- Randomly choosing
- Coarse to fine

Hyperparameters Tuning Methods in Deep Learning

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. It is required by the model when making predictions and estimated or learned from data. It is often not set manually by the practitioner. E.g. The coefficients in a linear regression or logistic regression; the weights in an artificial neural network, etc.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. It is often used in processes to help estimate model parameters. It is often specified by the practitioner. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. E.g. The learning rate for training a neural network; the K in K-nearest neighbors, etc.

Difference between Model Parameter and Model Hyperparameter

The trade-off between the standard prediction loss and the additive weight decay penalty is characterized by the regularization constant, $$\lambda$$. This nonnegative hyperparameter is fit using validation data and modifies the objective to $$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$. When $$\lambda = 0$$, the original loss function is recovered. For $$\lambda > 0$$, the size of the weights is restricted, with larger values of $$\lambda$$ constraining the weights more considerably. The penalty term is divided by $$2$$ by convention so that the constant cancels out gracefully when the derivative of the quadratic function is taken.

Regularization Constant

Feature scaling greatly affects which of the following supervised machine learning methods? 

Standardizing input vectors to have zero mean and unit variance not only places parameters on a similar a priori scale to aid optimizers, but it also constrains the complexity of the functions that act upon them. For instance, theoretical bounds such as the radius-margin bound in support vector machines and the Perceptron Convergence Theorem explicitly rely on the inputs having a bounded norm.

Feature Standardization and Function Complexity

Normalizing the outputs of layers in deep neural networks—by subtracting the mean and dividing by the standard deviation—helps to effectively mitigate the covariate shift problem. This reduction in covariate shift is a primary mechanism by which layer normalization improves overall training stability.

Reduction of Covariate Shift via Layer Normalization

Sampling bias during data collection can result in severe covariate shift, leading to models that fail in practice. For example, a medical algorithm designed to detect a disease using blood samples might be trained on a dataset consisting of sick older patients alongside healthy college students. Because the cohorts differ drastically in unrelated factors like age and hormone levels, the classifier might achieve high accuracy by learning these spurious features instead of genuine disease indicators. When deployed on real patients, the test will likely fail due to the extreme covariate shift between the training sample and the actual patient population.

Learn Before

Related

Learn After