Now if we look at the previous modification. In this case the words similar to the current one will have bigger scores for words that are just similar to the current ones. We want to have the relevant words to have a big score rather than similar. So at this step instead of taking the dot product of the actual embedding not we pass those embeddings through a usual Dense neural network(no activation function) before calculating the scores. This matrix, MLP, is called the Key matrix. Also I would be good that at each time stamp we would pass the current vector through another matrix rather than Keys(because if we just pass it through Keys and the end the scores will also be just taken on the account of the similarity in between vectors). We call it the Query matrix/ MLP. So at each step we pass the current vector through the Query neural network and all other vectors through the Keys natural network. And then process just goes as before:

Self-Attention layer understanding - Step 2 - Keys, Queries 

Note this is not how the actual self -attention layer in Transformer works but just modification of the seq_to_seq encoder. So over as the first step let’s just get rid of the RNNs used in seq2seq. For each word embedding we can score all others based on the dot attention score we already saw before. Each of those vectors we need to divide by square root of  the dimension of the input vectors to the score function. This trick should help gradients be more stable. Then we take a softmax of those. And then calculate the weighted sum of the other embeddings based on those softmax scores. Add this vector to the embedding vector of the current word  and we get the output for the current timestamp for the self-attention layer. 


University of California, Berkeley

Every individual layer within the Transformer encoder stack contains two primary sublayers: a multi-head self-attention pooling sublayer and a positionwise feed-forward network. In the encoder's self-attention mechanism, the queries, keys, and values are all sourced directly from the outputs of the immediately preceding encoder layer.

Transformer Encoder Sublayers

A very good article on transformer model. A very in-depth analysis

http://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Learn Before

Related

Learn After