The decoder state in the Seq2SeqAttentionDecoder is initialized from the encoder outputs as a three-element tuple: (i) the encoder's last-layer hidden states at all time steps, transposed to shape (batch_size, num_steps, num_hiddens), which serve as both the keys and values for the attention mechanism; (ii) the encoder's hidden states across all layers at the final time step, with shape (num_layers, batch_size, num_hiddens), used to initialize the decoder's GRU hidden state; and (iii) the valid lengths of the encoder inputs, used to mask padding tokens during attention pooling.

Seq2SeqAttentionDecoder State Initialization

During the forward pass of the Seq2SeqAttentionDecoder, target token indices are first embedded and transposed to shape (num_steps, batch_size, embed_size). The decoder then iterates over each time step. At each step: (1) the final-layer hidden state from the previous time step is unsqueezed to shape (batch_size, 1, num_hiddens) to act as the attention query; (2) the additive attention mechanism computes a context vector of shape (batch_size, 1, num_hiddens) by attending over all encoder outputs (keys and values), using valid lengths to mask padding; (3) the current embedded input is unsqueezed to shape (batch_size, 1, embed_size) and concatenated with the context vector along the feature dimension; (4) the concatenated tensor of shape (batch_size, 1, embed_size + num_hiddens) is transposed and fed into the GRU, which updates the decoder's hidden state. After processing all time steps, the GRU outputs are concatenated and projected through a dense layer to produce predictions of shape (batch_size, num_steps, vocab_size). Attention weights are stored for visualization.

Seq2SeqAttentionDecoder Forward Pass

The Seq2SeqAttentionDecoder class implements a concrete RNN decoder that integrates Bahdanau-style additive attention into the sequence-to-sequence framework. Its architecture consists of four components: an AdditiveAttention module, an embedding layer that maps target token indices to dense vectors, a multilayer GRU whose input size equals embed_size + num_hiddens (to accommodate the concatenated context and embedding), and a fully connected output layer that projects hidden states to the target vocabulary size. All parameters are initialized using Xavier initialization.

Claude

The AttentionDecoder class establishes the abstract base interface for all attention-based decoders by extending the standard Decoder base class. It inherits the init_state and forward-pass methods from its parent while introducing an additional contract: an attention_weights property. Subclasses must override this property to return the attention weight distributions computed during decoding. This added requirement ensures that any concrete attention-based decoder provides access to its internal attention weights, which is essential for interpretability and visualization of where the model focuses in the source sequence at each decoding step.

AttentionDecoder Base Interface

In an attention-based decoder, the RNN cell itself remains unchanged, but the encoder's hidden states are leveraged to inform word generation at each decoder time step. The word produced at each step becomes a function of every encoder hidden state together with the current decoder state. Because input sequences can vary in length, the number of encoder hidden states differs across examples, posing a challenge for fixing the input dimension of the output function. This is resolved by assigning a learned weight to each encoder hidden state based on its relevance to the current decoder state, then summing the weighted states. A score function evaluates the compatibility between each encoder hidden state and the decoder state:

Encoder states: $$h_{1}, h_{2}, h_{3}$$

Decoder current state: $$h'_{t}$$

Scores: $$\text{score}(h_{1}, h'_{t}),\; \text{score}(h_{2}, h'_{t}),\; \text{score}(h_{3}, h'_{t})$$

Applying softmax yields normalized weights: $$s_{1}, s_{2}, s_{3}$$

The context vector is computed as: $$c_{t} = s_{1} \cdot h_{1} + s_{2} \cdot h_{2} + s_{3} \cdot h_{3}$$

Each encoder state is scored against the decoder state, the scores are normalized via softmax, and the resulting weighted sum forms a context vector. This context vector, combined with the decoder hidden state, determines the current output word. In the Bahdanau attention variant specifically, the decoder hidden state at the previous time step serves as the query, and the encoder hidden states at all time steps serve as both the keys and the values.

Attention Decoder

Dive into Deep Learning

Seq2SeqAttentionDecoder Implementation

There are major two types of attention - Those are Additive and Multiplicative attention. Those are also called Bahdanau and Luong attention based on the first authors of two papers introducing those methods:

- Additive (Bahdanau) in here we actually use previous decoder hidden state to calculate score for the next decoder state
- Multiplicative(Luong) - this one uses current decoder state to calculate score for the each of the decoder states

Learn Before

Related

Learn After