1Cademy - Attention in vanilla Transformers

Relation

Attention in vanilla Transformers

Multi-head self-attention: multiple attention projections are computed and then concatenated into a single $D_m$ representation
Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position
Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder