As an example, a deep Gated Recurrent Unit (GRU) language model can be trained by specifying a nontrivial number of hidden layers, such as setting the `num_layers` parameter to 2. The architectural decisions and hyperparameters closely mirror those of single-layer networks: setting the number of inputs and outputs equal to the number of distinct tokens (`vocab_size`), and using a standard number of hidden units (e.g., 32). The primary structural difference is the explicit selection of multiple hidden layers.

Claude

Google

To implement a multilayer Recurrent Neural Network (RNN) from scratch, the network can be constructed by instantiating each layer as an individual basic recurrent unit, such as an `RNNScratch` object, possessing its own independent set of learnable parameters. The initial layer processes the original sequence data, whereas every subsequent layer receives the hidden representations produced by the layer immediately preceding it.

Stacked RNN Implementation from Scratch

Gated Recurrent Units (GRUs) resolve the vanishing gradients problem in simple RNNs and also ease the burden of introducing a considerable number of additional parameters in the LSTMs, by dispensing with the use of a separate context vector, and by reducing the number of gates to 2 — a reset/relevance gate and an update gate.
The purpose of the reset gate is to decide which aspects of the previous hidden state are relevant to the current context and what can be ignored. It computes an intermediate representation for the new hidden state at the current time.
The purpose of the update gate is to determine which aspects of this new state will be used directly in the new hidden state and which aspects of the previous state need to be preserved for future use.

Gated recurrent unit (GRU)

Dive into Deep Learning

The forward pass in a deep Recurrent Neural Network (RNN) implemented from scratch is computed iteratively across its layers. For each layer in the network stack, the sequence of inputs—which consists of either the raw input data or the processed outputs from the previous layer—is fed alongside that specific layer's current hidden state. The resultant outputs from all time steps are subsequently aggregated, for instance by stacking them along a new dimension, to be utilized as the input sequence for the succeeding layer or as the final model output.

Stacked RNN Forward Computation from Scratch

Example of Training a Deep GRU Model

Gated Recurrent Units (GRUs) are defined by two key distinguishing features that govern how they manage sequences:

- Reset Gates: These help the model capture short-term dependencies in sequences by controlling how much of the previous state should be ignored when computing the new candidate state.
- Update Gates: These help the model capture long-term dependencies in sequences by deciding how much of the previous hidden state should be preserved in the final hidden state.

Notably, GRUs contain basic (simple) RNNs as their extreme case: whenever the reset gate is fully activated (switched on), the candidate state computation becomes equivalent to a standard RNN update. GRUs can also effectively skip subsequences by activating the update gate, which causes the hidden state to be copied from the previous time step with minimal modification.

Distinguishing Features of GRUs

In a Gated Recurrent Unit (GRU) network, the learnable model parameters encompass weight matrices and bias vectors for the update gate, the reset gate, and the candidate hidden state. The dimensionality of these parameters is dictated by the input size and the hyperparameter defining the number of hidden units. A standard initialization strategy involves drawing all weight values from a Gaussian distribution with a specified standard deviation, while initializing all bias values exactly to $$0$$.

GRU Parameters Initialization

Similar to vanilla Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, a Gated Recurrent Unit (GRU) model can be implemented concisely by directly instantiating high-level API modules in modern deep learning frameworks. In PyTorch, the built-in nn.GRU layer is used; in MXNet, rnn.GRU; in JAX/Flax, nn.GRUCell combined with nn.scan to process sequences; and in TensorFlow, tf.keras.layers.GRU with return_sequences=True and return_state=True. This approach encapsulates all low-level configuration details—such as explicitly defining the update and reset gates or manually initializing their weight matrices and biases. The resulting code runs significantly faster during training because it leverages highly optimized, compiled backend operators rather than executing gate computations through standard Python loops.

Concise GRU Implementation

When evaluated against Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRUs) achieve comparable performance on sequence modeling tasks but tend to be computationally lighter. Compared with simple (vanilla) RNNs, gated recurrent architectures—including both LSTMs and GRUs—are substantially better at capturing dependencies across sequences with large time step distances, owing to their gating mechanisms that regulate information flow through the hidden state.

GRU vs LSTM Performance Comparison

The mathematical formulation of a Gated Recurrent Unit (GRU) at time step $$t$$ involves several key equations. The reset (or relevance) gate is $$\Gamma_r=\sigma(W_r[h^{<t-1>}, x^{<t>}]+b_r)$$, where $$\sigma$$ denotes the sigmoid activation function, $$W$$ is a parameter matrix, $$b$$ is a bias term, and $$[h^{<t-1>}, x^{<t>}]$$ represents the concatenation of the previous hidden state $$h^{<t-1>}$$ and the current input $$x^{<t>}$$. The update gate is computed as $$\Gamma_u=\sigma(W_u[h^{<t-1>}, x^{<t>}]+b_u)$$. The intermediate hidden state candidate is $$h^{'<t>}=\tanh(W_h[\Gamma_r*h^{<t-1>}, x^{<t>}]+b_h)$$. The current hidden state is then updated as $$h^{<t>}=(1-\Gamma_u)*h^{<t-1>}+\Gamma_u*h^{'<t>}$$, and the current output is $$\hat y^{<t>}=g(W_y*h^{<t>}+b_y)$$, where $$g$$ is an activation function.

Learn Before

Related