Learn Before
Relation

Attention in vanilla Transformers

  • Multi-head self-attention: multiple attention projections are computed and then concatenated into a single DmD_m representation

  • Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position

  • Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

0

1

Updated 2022-05-19

Contributors are:

Who are from:

Tags

Data Science

Related
Learn After