In the cross-attention layer of the transformer implementation for the encoder-decoder architecture, the final output of the encoder $H^{enc}$ is multiplied by the cross-attention layer’s key weights $W^K$ and value weights $W^V$, but the output from the prior decoder layer $H^{dec[i-1]}$ us multiplied by cross-attention layer’s query weights $W^Q$:
$$Q = W^QH^{dec[i-1]}; \; K = W^KH^{enc}; \; V = W^VH^{enc}$$
$$\implies \text{CrossAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
where $d_k$ is the dimension of the key vector.

University of Michigan - Ann Arbor

The encoder-decoder architecture can also be implemented using transformers, consisting of: - An encoder that takes the source language input words $$X = x_1, ..., x_T$$ and maps them to an output representation $$H^{enc} = h_1, ..., h_T$$; usually via $$N = 6$$ stacked encoder blocks. - A decoder which is similar to the one within the encoder-decoder RNN. However, the decoder transformer block includes an extra cross-attention layer in order to attend to the source language.

Encoder-Decoder with Transformers

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related