1Cademy - Attention Decoder

Learn Before

Attention Motivation

Concept

Attention Decoder

In an attention-based decoder, the RNN cell itself remains unchanged, but the encoder's hidden states are leveraged to inform word generation at each decoder time step. The word produced at each step becomes a function of every encoder hidden state together with the current decoder state. Because input sequences can vary in length, the number of encoder hidden states differs across examples, posing a challenge for fixing the input dimension of the output function. This is resolved by assigning a learned weight to each encoder hidden state based on its relevance to the current decoder state, then summing the weighted states. A score function evaluates the compatibility between each encoder hidden state and the decoder state:

Encoder states: $h_{1}, h_{2}, h_{3}$

Decoder current state: $h'_{t}$

Scores: $\text{score}(h_{1}, h'_{t}),\; \text{score}(h_{2}, h'_{t}),\; \text{score}(h_{3}, h'_{t})$

Applying softmax yields normalized weights: $s_{1}, s_{2}, s_{3}$

The context vector is computed as: $c_{t} = s_{1} \cdot h_{1} + s_{2} \cdot h_{2} + s_{3} \cdot h_{3}$

Each encoder state is scored against the decoder state, the scores are normalized via softmax, and the resulting weighted sum forms a context vector. This context vector, combined with the decoder hidden state, determines the current output word. In the Bahdanau attention variant specifically, the decoder hidden state at the previous time step serves as the query, and the encoder hidden states at all time steps serve as both the keys and the values.

0

1

Updated 2026-05-14

Contributors are:

RA

Who are from:

University of California, Berkeley

✔️ 1

References

Learn Before

Related

Learn After