For a minibatch of sequence inputs with a batch size of 4 and 9 time steps, a two-layer GRU encoder with 16 hidden units produces two tensors:

1. The `enc_outputs` tensor of shape `(num_steps, batch_size, num_hiddens) = (9, 4, 16)`, representing the top-layer hidden states at every time step.
2. The `enc_state` tensor of shape `(num_layers, batch_size, num_hiddens) = (2, 4, 16)`, containing the multilayer hidden states at the final time step only.

Since GRUs use a single hidden state vector (unlike LSTMs, which also maintain a separate memory cell), the state tensor has exactly three dimensions.

Claude

Google

The `Seq2SeqEncoder` class implements the RNN-based encoder for sequence-to-sequence learning by extending a base `Encoder` interface. Its architecture consists of two primary components: an **embedding layer** that converts each input token index into a dense feature vector, and a **multilayer GRU** that processes the resulting sequence of embeddings. The embedding layer's weight matrix has a shape of (`vocab_size`, `embed_size`), where each row $$i$$ stores the feature vector for the token with index $$i$$. During the forward pass, the input tensor of shape (batch_size, num_steps) is first transposed and embedded to produce a tensor of shape (num_steps, batch_size, embed_size). The GRU then processes this sequence and returns two outputs: `outputs` of shape (num_steps, batch_size, num_hiddens), containing the final-layer hidden states at every time step, and `state` of shape (num_layers, batch_size, num_hiddens), containing the hidden states of all layers at the final time step. All weights are initialized using Xavier initialization.

Learn Before

Related