For a single neuron example: you want to train a deep network without the weights exploding or vanishing.

Example of Weight Initialization 

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

Symmetry breaking refers to a requirement of initializing machine learning models like neural networks.

When a neural network model has weights, all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. This is known as the “symmetry” problem.

Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.

Symmetry Breaking in Deep Learning


The longest part of the development of a deep learning model is that of the training stage. However, if the problem that is being tackled has already been studied and worked on by someone else and the model has been released to the public including the trained weights then a technique to substantially increase the training process is to utilize transfer learning.

Transfer Learning in Deep Learning

In multi-task learning, the goal is to try to have one neural network classify/predict multiple outputs at the same time, with each of the tasks helping in the execution of the other tasks, i.e., learning from multiple tasks. This contrasts with transfer learning where the tasks are done sequentially, with the previous task helping in the execution of the next.

Multi-task Learning in Deep Learning

When analyzing the scale distribution of an output $$o_i$$ for a fully connected layer without nonlinearities, the output is computed as $$o_i = \sum_{j=1}^{n_	extrm{in}} w_{ij} x_j$$. Assuming the inputs $$x_j$$ and weights $$w_{ij}$$ are drawn independently with a mean of $$0$$ and variances of $$\gamma^2$$ and $$\sigma^2$$ respectively, the expected value $$E[o_i]$$ is $$0$$. We can compute the variance as $$	extrm{Var}[o_i] = E[o_i^2] - (E[o_i])^2 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij} x^2_j] - 0 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij}] E[x^2_j] = n_	extrm{in} \sigma^2 \gamma^2$$. Note that the distribution does not have to be Gaussian, but the mean and variance must exist. To keep this variance fixed during forward propagation and prevent it from changing across layers, the initialization must satisfy the condition $$n_	extrm{in} \sigma^2 = 1$$.

Variance of Layer Output in Forward Propagation

When building a neural network, if a specific parameter initialization method is not explicitly defined by the user, the deep learning framework will apply a default random initialization method. This default approach, such as drawing weight values from a standard normal distribution, is often sufficient and works well in practice for models with moderate problem sizes.

Default Random Initialization

Xavier initialization, named after its creators Glorot and Bengio (2010), is a standard technique designed to mitigate vanishing and exploding gradients by carefully setting the initial weights of a neural network layer. To balance the variance during both forward and backward propagation, it typically samples weights from a Gaussian distribution with a mean of $$0$$ and a variance of $$\sigma^2 = \frac{2}{n_	extrm{in} + n_	extrm{out}}$$, where $$n_	extrm{in}$$ and $$n_	extrm{out}$$ represent the number of inputs and outputs of the layer respectively. While the underlying assumption of linear activations is often violated in practice, this initialization method has proven highly effective.

Xavier Initialization

Deep learning frameworks provide built-in initializers to establish the starting values of model parameters programmatically. A common baseline approach for neural network layers is to initialize all weight parameters as Gaussian random variables with a mean of $$0$$ and a specific standard deviation, such as $$0.01$$, while concurrently clearing all bias parameters to exactly $$0$$.

Built-in Gaussian Parameter Initialization

Beyond random distributions, deep learning frameworks provide utilities to initialize all parameters of a neural network or a specific layer to a given constant numerical value, such as $$1$$. While initializing weights to a constant is typically avoided due to symmetry breaking, constant initialization can be programmatically applied when specific deterministic starting values are required.

Constant Parameter Initialization

Neural network parameters do not need to be initialized uniformly across an entire model. Deep learning frameworks allow practitioners to apply distinct initialization methods to specific architectural blocks or layers. For instance, one layer might use the Xavier initializer to maintain activation variance, while another layer in the same network could have its parameters initialized to a specific constant value.

Block-Specific Parameter Initialization

Deep learning frameworks provide mechanisms to programmatically override existing parameter values during the initialization phase. While attempting to initialize a network that has already been initialized might normally be ignored to prevent accidental overwriting, using specific functions or arguments (such as a forced reinitialization flag) ensures that parameters are freshly initialized, regardless of whether they previously contained values.

Forced Parameter Reinitialization

When standard initialization methods are insufficient, deep learning frameworks allow practitioners to define custom parameter initialization routines. This is achieved by creating a custom function or class that applies a desired mathematical distribution or logic to a given parameter tensor. Once defined, this custom initializer can be applied to the neural network to populate the weights according to the specified custom logic.

Custom Parameter Initialization

Beyond using predefined or custom initialization functions, deep learning frameworks offer the flexibility of setting parameter values directly. Practitioners can access the underlying tensor data of a model's weights and apply direct mutations, such as adding a constant to all elements or assigning a specific numerical value to an exact matrix index. This direct assignment provides granular control over individual parameter values after their initial creation.

Direct Parameter Assignment

Lazy parameter initialization is a convenient deep learning technique where the framework automatically infers the shapes of model parameters. This dynamic shape inference makes it easier to modify network architectures and eliminates a common source of dimension mismatch errors during model construction.

Lazy Parameter Initialization

To prevent the gradients of a neural network's activations from vanishing or exploding, weight initialization strategies adhere to two fundamental rules: the mean of the activations should be exactly zero, and their variance must remain constant across all layers. By satisfying these conditions, the backpropagated gradient signal avoids being multiplied by excessively small or large values. Consequently, maintaining a zero mean and constant variance guarantees a stable gradient signal throughout the network.

How to Initialize Weights to Prevent Vanishing/Exploding Gradients

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

University of Michigan - Ann Arbor

University of Rochester

Claude

A Feed forward neural network model training occurs in six stages:

- Iterate until convergence
   - Initialization
   - Forward propagation
   - Error function (Objective function)
   - Backpropagation
   - Weight update

At the end of this process, the model is ready to make predictions for unknown input data. New data can be fed to the model, a forward pass is performed, and the model generates its prediction.

List of Common Hyperparameters in Deep Learning

A helpful website that introduces neural networks:
https://missinglink.ai/guides/neural-network-concepts/

Neural Network Reference

Dive into Deep Learning

When we use a feedforward neural network to accept an input $x$ from training set and produce an output $\hat{y}$, information ﬂows forward through the network. The input $x$ provides the initial information that then propagates up to the hidden units at each layer and ﬁnally produces $\hat{y}$. This is called forward propagation. 


Forward Propagation

Weights are changed to the optimal values according to the results of the backpropagation algorithm.

Since the weights are updated a small delta step at a time, several iterations are required in order for the network to learn. After each iteration, the gradient descent force updates the weights towards less and less global loss function. The amount of iterations needed to converge depends on the learning rate, the network meta-parameters, and the optimization method used.

Update Weight Iteratively Until Convergence

Deep Learning Weight Initialization

What is the "cache" used for in our implementation of forward propagation and backward propagation?

Which of the following statements are True? (Check all that apply).

Consider the following 1 hidden layer neural network:

Which of the following are true regarding activation outputs and vectors? (Check all that apply.)

Backpropagation is a systematic computational procedure for applying the chain rule to calculate gradients automatically. It operates by traversing a computational graph in a backwards direction—from the output loss back to the input parameters—multiplying matrices of partial derivatives at each step to determine how parameters affect the final output.

Backpropagation

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

Dropout means what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

Dropout

Neural network learning rate means how fast the backpropagation algorithm performs gradient descent. A larger learning rate makes the network train faster but might result in missing the minimum of the loss function.

Neural Network Learning Rate

When you try to train a huge amount of data with limited computer memory, you can separate the whole training set to several batches that can fit into your computer memory. Then you feed these batches to your model one by one. After feeding all batches once, you complete one epoch. To successfully train your model, you need multiple epochs.

Epochs in Machine Learning

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When used in a multilayer neural network, activation functions can be different for different layers. 

Activation Functions in Neural Networks

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

- Grid search
- Randomly choosing
- Coarse to fine

Hyperparameters Tuning Methods in Deep Learning

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. It is required by the model when making predictions and estimated or learned from data. It is often not set manually by the practitioner. E.g. The coefficients in a linear regression or logistic regression; the weights in an artificial neural network, etc.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. It is often used in processes to help estimate model parameters. It is often specified by the practitioner. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. E.g. The learning rate for training a neural network; the K in K-nearest neighbors, etc.

Difference between Model Parameter and Model Hyperparameter

The trade-off between the standard prediction loss and the additive weight decay penalty is characterized by the regularization constant, $$\lambda$$. This nonnegative hyperparameter is fit using validation data and modifies the objective to $$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$. When $$\lambda = 0$$, the original loss function is recovered. For $$\lambda > 0$$, the size of the weights is restricted, with larger values of $$\lambda$$ constraining the weights more considerably. The penalty term is divided by $$2$$ by convention so that the constant cancels out gracefully when the derivative of the quadratic function is taken.

Regularization Constant

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

Learn Before

Related

Learn After