One of the main papers that introduces the attention mechanism. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. https://arxiv.org/abs/1409.0473

Neural Machine Translation by Jointly Learning to Align and Translate

Also another main papers on attention mechanism
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

Effective Approaches to Attention-based Neural Machine Translation

Attention is one of the most important innovations in deep learning for the last few years. The papers that introduced this mechanism also can consider the example of machine translation. Let’s quickly review the encoder-decoder architecture. We have an encoder and decoder parts. Encoder part runs an RNN through the input, and returns the final one context vector which we lately use during the decoding phase where we feed it to another RNN as initial hidden inputs. One big problem with that is when sentences get long the performance drops considerably, even though LSTM are supposed to keep the long term information. To fight with long sentences, researchers came up with the technique called attention. 
The attention mechanism tries to mimic our thinking because we are first focusing on different elements in the sentence or in the image before describing what is in there. In this case instead of only one vector benign passed to the decoder, we pass all the hidden layer vectors from each time stamp.


Attention Motivation

Over here is how the model is doing translation from English sentence "How are you?" to Russian. In this case, it is multiplicative attention because it uses current decoder vector to calculate the current word

Example of how Attention is used in Machine Translation

A very good article on transformer model. A very in-depth analysis

http://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Attention Is All You Need

A very good video explaining the transformer model:
https://www.youtube.com/watch?v=rBCqOTEfxvg

Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass

A Colab notebook of how to use Transformer model with Tensor2Tensor library based on Tensorflow:
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb


Tensor2Tensor Intro

The concept of attention helped dramatically improve the seq_to_seq model. There was a lot of improvement and development on that concept since those two main papers on attention got released. One of them is the paper “Attention is All you Need.” The paper introduced a model called “Transformer” that uses only attention mechanisms to work with sequential data - no RNNs. The model they created is subject to parallelization and works very well with GPU. One of the main problems with RNN is that the algorithm is hardly parallelizable because before we can evaluate one time stamp in the encoder we need the previous one. You will see how we can easily parallelize the Transformer model. As before let’s consider the example of machine translation. As before the model consists from two parts:
- Encoders
- Decoders


Transformer model

The Transformer is a deep learning architecture built exclusively on attention mechanisms, foregoing traditional recurrent or convolutional layers. A defining property of the Transformer is its superior scaling behavior: its performance consistently improves as the dataset size, model size, and computational budget increase. This architecture has become foundational, driving state-of-the-art results across natural language processing, computer vision, speech recognition, and reinforcement learning.

Transformer

Efficient Transformers: A Survey


Transformers known for their self attention  mechanism and parallelization of sequential data has growing concern over quadratic time and memory complexity.
Efficient transformers address this issue by having better memory capacity and computational costs compared to early stage transformers.

Evaluation of Efficient Transformers

Computing the value for current decode is based on the previous hidden state, the previous word generated, and the current context vector. This context vector is derived from the attention computation based on comparing the previous hidden state to all of the encoder hidden states.

University of Michigan - Ann Arbor

Claude

The number of hidden states generated from the encoding process varies with the size of the input, making it difficult to use them directly as a context for the decode. - Solution 1: basic RNN-based architecture     - Advantage: simple; reduce the context to a fixed-length vector.     - Drawback: the final hidden state is more focused on the latter parts of the input sequence. - Solution 2: Bi-RNNs     - Advantage: focuses on the input as a whole, rather than only the latter parts.     - Drawback: loses information about each of the individual encoder states that might be useful in decoding. - Solution 3: attention mechanism     - Advantages: considers the whole encoder context; dynamically updates during decoding; can be embodied in a fixed-size vector.

Context vector

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related

Learn After