The `Seq2SeqAttentionDecoder` can be verified by instantiating a test configuration with `vocab_size = 10`, `embed_size = 8`, `num_hiddens = 16`, and `num_layers = 2`, then feeding a minibatch of 4 sequences, each with 7 time steps. After running the encoder and initializing the decoder state, the forward pass produces an output tensor of shape `(batch_size, num_steps, vocab_size) = (4, 7, 10)`. The returned state contains the encoder outputs of shape `(batch_size, num_steps, num_hiddens) = (4, 7, 16)` and a per-layer decoder hidden state where each layer has shape `(batch_size, num_hiddens) = (4, 16)`.

Claude

Google

During the forward pass of the Seq2SeqAttentionDecoder, target token indices are first embedded and transposed to shape (num_steps, batch_size, embed_size). The decoder then iterates over each time step. At each step: (1) the final-layer hidden state from the previous time step is unsqueezed to shape (batch_size, 1, num_hiddens) to act as the attention query; (2) the additive attention mechanism computes a context vector of shape (batch_size, 1, num_hiddens) by attending over all encoder outputs (keys and values), using valid lengths to mask padding; (3) the current embedded input is unsqueezed to shape (batch_size, 1, embed_size) and concatenated with the context vector along the feature dimension; (4) the concatenated tensor of shape (batch_size, 1, embed_size + num_hiddens) is transposed and fed into the GRU, which updates the decoder's hidden state. After processing all time steps, the GRU outputs are concatenated and projected through a dense layer to produce predictions of shape (batch_size, num_steps, vocab_size). Attention weights are stored for visualization.

Seq2SeqAttentionDecoder Forward Pass

The decoder state in the Seq2SeqAttentionDecoder is initialized from the encoder outputs as a three-element tuple: (i) the encoder's last-layer hidden states at all time steps, transposed to shape (batch_size, num_steps, num_hiddens), which serve as both the keys and values for the attention mechanism; (ii) the encoder's hidden states across all layers at the final time step, with shape (num_layers, batch_size, num_hiddens), used to initialize the decoder's GRU hidden state; and (iii) the valid lengths of the encoder inputs, used to mask padding tokens during attention pooling.

Learn Before

Related