In a Long Short-Term Memory (LSTM) network, the learnable model parameters include weight matrices and bias vectors for the three gates (input, forget, and output), as well as the input node. The dimensions of these parameters depend on the input size and the chosen number of hidden units. A standard initialization strategy involves drawing all weight values from a Gaussian distribution with a small standard deviation (e.g., $$ 0.01 $$), and initializing all bias values exactly to $$ 0 $$.

Claude

Google

Gates are made use of by LSTM units to control the flow of information into and out of the units that comprise the network layers.
- The Forget Gate deletes information from the context that is no longer needed
- The Add Gate selects the information to add to the current context
- The Output Gate decides what information is required for the current hidden state

Gates in LSTMs

Deep learning frameworks provide built-in initializers to establish the starting values of model parameters programmatically. A common baseline approach for neural network layers is to initialize all weight parameters as Gaussian random variables with a mean of $$0$$ and a specific standard deviation, such as $$0.01$$, while concurrently clearing all bias parameters to exactly $$0$$.

Built-in Gaussian Parameter Initialization

Dive into Deep Learning

LSTM Parameters Initialization

The forget gate is $$\Gamma_f=\sigma(W_f[a^{<t-1>}, x^{<t>}]+b_f)$$, where $$\sigma$$ denotes the sigmoid activation function, $$W_f$$ is the weight matrix, $$b_f$$ is a bias term, $$a^{<t-1>}$$ denotes the hidden state from the previous time step, and $$x^{<t>}$$ is the input at the $$t$$-th time step. The notation $$[a^{<t-1>}, x^{<t>}]$$ means that $$a^{<t-1>}$$ and $$x^{<t>}$$ are concatenated. Then, compute the update gate in two steps. First, the update gate is $$\Gamma_u=\sigma(W_u[a^{<t-1>}, x^{<t>}]+b_u)$$. Second, the intermediate cell state candidate is $$\tilde{c}^{<t>}=\tanh(W_c[a^{<t-1>}, x^{<t>}]+b_c)$$, where $$\tanh$$ denotes the hyperbolic tangent activation function. Using the results from the formulas above, we can calculate the current cell state as $$c^{<t>}=\Gamma_u*\tilde{c}^{<t>}+\Gamma_f*c^{<t-1>}$$. Finally, the third gate, the output gate, is $$\Gamma_o=\sigma(W_o[a^{<t-1>}, x^{<t>}]+b_o)$$, and using the output gate and the current cell state, we can compute the current hidden state as $$a^{<t>}=\Gamma_o*\tanh(c^{<t>})$$.

Learn Before

Related