Learn Before
Concept

Cross-Attention Layer

In the cross-attention layer of the transformer implementation for the encoder-decoder architecture, the final output of the encoder HencH^{enc} is multiplied by the cross-attention layer’s key weights WKW^K and value weights WVW^V, but the output from the prior decoder layer Hdec[i1]H^{dec[i-1]} us multiplied by cross-attention layer’s query weights WQW^Q: Q=WQHdec[i1];  K=WKHenc;  V=WVHencQ = W^QH^{dec[i-1]}; \; K = W^KH^{enc}; \; V = W^VH^{enc}     CrossAttention(Q,K,V)=softmax(QKTdk)V\implies \text{CrossAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V where dkd_k is the dimension of the key vector.

0

0

Updated 2021-12-05

Tags

Data Science

Related