The reset/relevance gate is
$$\Gamma_r=\sigma(W_r[h^{<t-1>}, x^{<t>}]+b_r)$$
, where $\sigma$ denotes the sigmoid activation function, $W$ is parameter matrix, $b$ is a bias term, $t$ is the t-th time/neuron, and $[h^{<t-1>}, x^{<t>}]$ means $h^{<t-1>}$ and $x^{<t>}$ are concatenated together.
The update gate is 
$$\Gamma_u=\sigma(W_u[h^{<t-1>}, x^{<t>}]+b_u)$$
Then the intermediate hidden state candidate is
$$h^{'<t>}=tanh(W_h[\Gamma_r*h^{<t-1>}, x^{<t>}]+b_h)$$
Then the current hidden state is 
$$h^{<t>}=(1-\Gamma_u)*h^{<t-1>}+\Gamma_u*h^{'<t>}$$
, and the current output is
$$\hat y^{<t>}=g(W_y*h^{<t>}+b_y)$$
, where $g$ is an activation function.

University of Michigan - Ann Arbor

Claude

Gated Recurrent Units (GRUs) resolve the vanishing gradients problem in simple RNNs and also ease the burden of introducing a considerable number of additional parameters in the LSTMs, by dispensing with the use of a separate context vector, and by reducing the number of gates to 2 — a reset/relevance gate and an update gate.
The purpose of the reset gate is to decide which aspects of the previous hidden state are relevant to the current context and what can be ignored. It computes an intermediate representation for the new hidden state at the current time.
The purpose of the update gate is to determine which aspects of this new state will be used directly in the new hidden state and which aspects of the previous state need to be preserved for future use.

Gated recurrent unit (GRU)

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Dive into Deep Learning

Math behind GRUs

Gated Recurrent Units (GRUs) are defined by two key distinguishing features that govern how they manage sequences:

- Reset Gates: These help the model capture short-term dependencies in sequences by controlling how much of the previous state should be ignored when computing the new candidate state.
- Update Gates: These help the model capture long-term dependencies in sequences by deciding how much of the previous hidden state should be preserved in the final hidden state.

Notably, GRUs contain basic (simple) RNNs as their extreme case: whenever the reset gate is fully activated (switched on), the candidate state computation becomes equivalent to a standard RNN update. GRUs can also effectively skip subsequences by activating the update gate, which causes the hidden state to be copied from the previous time step with minimal modification.

Distinguishing Features of GRUs

In a Gated Recurrent Unit (GRU) network, the learnable model parameters encompass weight matrices and bias vectors for the update gate, the reset gate, and the candidate hidden state. The dimensionality of these parameters is dictated by the input size and the hyperparameter defining the number of hidden units. A standard initialization strategy involves drawing all weight values from a Gaussian distribution with a specified standard deviation, while initializing all bias values exactly to $$0$$.

GRU Parameters Initialization

Similar to vanilla Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, a Gated Recurrent Unit (GRU) model can be implemented concisely by directly instantiating high-level API modules in modern deep learning frameworks. In PyTorch, the built-in nn.GRU layer is used; in MXNet, rnn.GRU; in JAX/Flax, nn.GRUCell combined with nn.scan to process sequences; and in TensorFlow, tf.keras.layers.GRU with return_sequences=True and return_state=True. This approach encapsulates all low-level configuration details—such as explicitly defining the update and reset gates or manually initializing their weight matrices and biases. The resulting code runs significantly faster during training because it leverages highly optimized, compiled backend operators rather than executing gate computations through standard Python loops.

Concise GRU Implementation

When evaluated against Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRUs) achieve comparable performance on sequence modeling tasks but tend to be computationally lighter. Compared with simple (vanilla) RNNs, gated recurrent architectures—including both LSTMs and GRUs—are substantially better at capturing dependencies across sequences with large time step distances, owing to their gating mechanisms that regulate information flow through the hidden state.

Learn Before

Related