In machine learning, visualizing and inspecting the training data is highly recommended. Because human perception is naturally adept at identifying visual oddities and patterns, data visualization serves as a crucial safeguard against errors and mistakes during the design of experiments.

Importance of Training Data Visualization

1. Training data included the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding.
2. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M. Sentence pairs were batched together by approximate sequence length.
3. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
4. Optimizer – Adam B1 = 0:9, B2 = 0:98 and e = 10^-9

New York University

This paper is based on the working of a transformer model. A transformer model basically helps in transforming a sequence of input into another depending on the problem statement.
These include translation of a language to another, or an answer for a question, with the help of an encoder and decoder model stacked together.

Attention is all you Need (Presentation)

1. RNN, LSTM, and gated RNNs are the popularly used approaches used for sequence modeling tasks such as machine translation and language modeling,
2. RNN/CNN handles sequences word by word in a sequential fashion, the sequentiality is an obstacle towards parallelization of the process. Moreover, when sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with the content of the following positions.
3. Attention mechanisms are one of the solutions to overcome the problem of model forgetting. This is because they allow dependency modeling without considering their distance in the input or output sequences.


What problem is this paper trying to solve?

1. The transformer model proposed in this paper is an architecture that relies entirely on the attention mechanism to draw global dependencies b/w input and output.
2. It allows significantly more parallelization and has a huge role in improving translation quality after being trained for short periods of time on 8 P100 GPUs (12hrs).


Transformer model

Training Data


The Transformer offers several computational and performance advantages over recurrent architectures:

- **Constant sequential operations**: A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent neural network (RNN) requires $$O(n)$$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers whenever the sequence length is smaller than the representation dimensionality.
- **Faster training**: For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
- **State-of-the-art translation results**: On both the WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, the Transformer achieved a new state-of-the-art. On the English-to-German task, the best model outperformed even all previously reported ensembles.

Advantages and Performance of the Transformer Model

The Transformer model employs two primary regularization techniques during training:

1. **Residual Dropout**: Applied to the output of each sub-layer before it is added to the sub-layer input and normalized. Dropout is also applied to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, a dropout rate of $$P_{drop} = 0.1$$ is used.

2. **Label Smoothing**: Employed during training. While this hurts perplexity, as the model learns to be more unsure, it improves accuracy and BLEU (Bilingual Evaluation Understudy) scores.

Regularization Techniques in the Transformer Model

The Transformer model introduced in 'Attention is All You Need' relies on two attention functions: 1) Scaled Dot-Product Attention, which computes attention over a set of queries simultaneously by packing them into a matrix, scaling the query-key dot products, and applying softmax to produce a weighted sum of the values; and 2) Multi-Head Attention, which runs several scaled dot-product attention functions in parallel so the model can jointly attend to information from different representation subspaces at different positions.

Learn Before

Related

Learn After