In a Long Short-Term Memory (LSTM) network, the learnable model parameters include weight matrices and bias vectors for the three gates (input, forget, and output), as well as the input node. The dimensions of these parameters depend on the input size and the chosen number of hidden units. A standard initialization strategy involves drawing all weight values from a Gaussian distribution with a small standard deviation (e.g., $$ 0.01 $$), and initializing all bias values exactly to $$ 0 $$.

LSTM Parameters Initialization

Deep learning frameworks provide built-in initializers to establish the starting values of model parameters programmatically. A common baseline approach for neural network layers is to initialize all weight parameters as Gaussian random variables with a mean of $$0$$ and a specific standard deviation, such as $$0.01$$, while concurrently clearing all bias parameters to exactly $$0$$.

Claude

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

Deep Learning Weight Initialization

Dive into Deep Learning

For a single neuron example: you want to train a deep network without the weights exploding or vanishing.

Example of Weight Initialization 

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

Symmetry breaking refers to a requirement of initializing machine learning models like neural networks.

When a neural network model has weights, all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. This is known as the “symmetry” problem.

Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.

Symmetry Breaking in Deep Learning


The longest part of the development of a deep learning model is that of the training stage. However, if the problem that is being tackled has already been studied and worked on by someone else and the model has been released to the public including the trained weights then a technique to substantially increase the training process is to utilize transfer learning.

Transfer Learning in Deep Learning

In multi-task learning, the goal is to try to have one neural network classify/predict multiple outputs at the same time, with each of the tasks helping in the execution of the other tasks, i.e., learning from multiple tasks. This contrasts with transfer learning where the tasks are done sequentially, with the previous task helping in the execution of the next.

Multi-task Learning in Deep Learning

When analyzing the scale distribution of an output $$o_i$$ for a fully connected layer without nonlinearities, the output is computed as $$o_i = \sum_{j=1}^{n_	extrm{in}} w_{ij} x_j$$. Assuming the inputs $$x_j$$ and weights $$w_{ij}$$ are drawn independently with a mean of $$0$$ and variances of $$\gamma^2$$ and $$\sigma^2$$ respectively, the expected value $$E[o_i]$$ is $$0$$. We can compute the variance as $$	extrm{Var}[o_i] = E[o_i^2] - (E[o_i])^2 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij} x^2_j] - 0 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij}] E[x^2_j] = n_	extrm{in} \sigma^2 \gamma^2$$. Note that the distribution does not have to be Gaussian, but the mean and variance must exist. To keep this variance fixed during forward propagation and prevent it from changing across layers, the initialization must satisfy the condition $$n_	extrm{in} \sigma^2 = 1$$.

Variance of Layer Output in Forward Propagation

When building a neural network, if a specific parameter initialization method is not explicitly defined by the user, the deep learning framework will apply a default random initialization method. This default approach, such as drawing weight values from a standard normal distribution, is often sufficient and works well in practice for models with moderate problem sizes.

Default Random Initialization

Xavier initialization, named after its creators Glorot and Bengio (2010), is a standard technique designed to mitigate vanishing and exploding gradients by carefully setting the initial weights of a neural network layer. To balance the variance during both forward and backward propagation, it typically samples weights from a Gaussian distribution with a mean of $$0$$ and a variance of $$\sigma^2 = \frac{2}{n_	extrm{in} + n_	extrm{out}}$$, where $$n_	extrm{in}$$ and $$n_	extrm{out}$$ represent the number of inputs and outputs of the layer respectively. While the underlying assumption of linear activations is often violated in practice, this initialization method has proven highly effective.

Xavier Initialization

Built-in Gaussian Parameter Initialization

Beyond random distributions, deep learning frameworks provide utilities to initialize all parameters of a neural network or a specific layer to a given constant numerical value, such as $$1$$. While initializing weights to a constant is typically avoided due to symmetry breaking, constant initialization can be programmatically applied when specific deterministic starting values are required.

Constant Parameter Initialization

Neural network parameters do not need to be initialized uniformly across an entire model. Deep learning frameworks allow practitioners to apply distinct initialization methods to specific architectural blocks or layers. For instance, one layer might use the Xavier initializer to maintain activation variance, while another layer in the same network could have its parameters initialized to a specific constant value.

Block-Specific Parameter Initialization

Deep learning frameworks provide mechanisms to programmatically override existing parameter values during the initialization phase. While attempting to initialize a network that has already been initialized might normally be ignored to prevent accidental overwriting, using specific functions or arguments (such as a forced reinitialization flag) ensures that parameters are freshly initialized, regardless of whether they previously contained values.

Forced Parameter Reinitialization

When standard initialization methods are insufficient, deep learning frameworks allow practitioners to define custom parameter initialization routines. This is achieved by creating a custom function or class that applies a desired mathematical distribution or logic to a given parameter tensor. Once defined, this custom initializer can be applied to the neural network to populate the weights according to the specified custom logic.

Custom Parameter Initialization

Beyond using predefined or custom initialization functions, deep learning frameworks offer the flexibility of setting parameter values directly. Practitioners can access the underlying tensor data of a model's weights and apply direct mutations, such as adding a constant to all elements or assigning a specific numerical value to an exact matrix index. This direct assignment provides granular control over individual parameter values after their initial creation.

Direct Parameter Assignment

Lazy parameter initialization is a convenient deep learning technique where the framework automatically infers the shapes of model parameters. This dynamic shape inference makes it easier to modify network architectures and eliminates a common source of dimension mismatch errors during model construction.

Lazy Parameter Initialization

To prevent the gradients of a neural network's activations from vanishing or exploding, weight initialization strategies adhere to two fundamental rules: the mean of the activations should be exactly zero, and their variance must remain constant across all layers. By satisfying these conditions, the backpropagated gradient signal avoids being multiplied by excessively small or large values. Consequently, maintaining a zero mean and constant variance guarantees a stable gradient signal throughout the network.

Learn Before

Related

Learn After