1. Scaled Dot Product Attention:- 
Computes the attention function on a set of queries simultaneously, packed together into a matrix
2. Multi-head attention:- 
Allows the model to jointly attend to information from different representation subspaces at different positions.


New York University

This paper is based on the working of a transformer model. A transformer model basically helps in transforming a sequence of input into another depending on the problem statement.
These include translation of a language to another, or an answer for a question, with the help of an encoder and decoder model stacked together.

Attention is all you Need (Presentation)

1. RNN, LSTM, and gated RNNs are the popularly used approaches used for sequence modeling tasks such as machine translation and language modeling,
2. RNN/CNN handles sequences word by word in a sequential fashion, the sequentiality is an obstacle towards parallelization of the process. Moreover, when sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with the content of the following positions.
3. Attention mechanisms are one of the solutions to overcome the problem of model forgetting. This is because they allow dependency modeling without considering their distance in the input or output sequences.


What problem is this paper trying to solve?

1. The transformer model proposed in this paper is an architecture that relies entirely on the attention mechanism to draw global dependencies b/w input and output.
2. It allows significantly more parallelization and has a huge role in improving translation quality after being trained for short periods of time on 8 P100 GPUs (12hrs).


Transformer model

Attention Functions Used

1. Training data included the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding.
2. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M. Sentence pairs were batched together by approximate sequence length.
3. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
4. Optimizer – Adam B1 = 0:9, B2 = 0:98 and e = 10^-9

Training Data


1. Residual Dropout:-
Residual dropout is applied to the output of each sub-layer before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of P_drop = 0.1.
2. Label Smoothing:- 
During training, label smoothing is employed with the below value. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU (Bilingual Evaluation Understudy) score.


Regularizations used

The Transformer model presents several computational and performance advantages:

- A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent neural network (RNN) requires $$O(n)$$ sequential operations. In terms of complexity, self-attention layers are faster than recurrent layers when the sequence length is smaller than the representation dimensionality.
- For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
- On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, the Transformer achieved a new state-of-the-art. In the former task, the best model outperformed even all previously reported ensembles.

Learn Before

Related