a self-attention layer maps input sequences(x1,...,xn) to output sequences of the same length (${y_1},...,{y_n}$). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one. 
In the case of self-attention, the set of comparisons are to other elements within a given sequence. The simplest form of comparison between elements in a self-attention layer is a dot product:
$score({x_i}, {x_j}) = {x_i}· {x_j}$
The larger the value the more similar the vectors that are being compared. Then to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, $α_{ij}$, that indicates the proportional relevance of each input to the input element i that is the current focus of attention.
$α_{ij} = \frac{exp(score({x_i}, {x_j}))}{\sum_{k=1}^i exp(score({x_i}, {x_k}))}$, ${\forall}$ j ≤ i
Given the proportional scores in α, we then generate an output value yi by taking the sum of the inputs seen so far, weighted by their respective α value. 
${y_i} =\sum_{j≤i} α_{ij}{x_j}$

Self-attention layers' first approach 

The transformer model can also be used for the contextual generation task and text summarization task.

During the contextual generation, the model is given some prefix text and will output a possible completion to it. The transformer model can have direct access to all the prefix text and the subsequently generated output of its own.

As for the text summarization task, the training set contains multiple full-length articles accompanied by their summaries with a unique marker separating these two parts, where one training unit is like $$(x_1,...,x_m,δ,y_1,...y_n)$$. Teacher-forcing also applies during the training.


Transformers in contextual generation and summarization


https://huggingface.co/docs/transformers/model_summary

Huggingface Model Summary

Lin, Tianyang & Wang, Yuxin & Liu, Xiangyang & Qiu, Xipeng. (2021). A Survey of Transformers. 

A Survey of Transformers (Lin et. al, 2021)

- Encoder-Decoder: sequence to sequence (language modeling)
- Encoder Only: outputs of the encoder are utilized as a representation for the input sequence. This is usually used for classification or sequence labeling problems (i.e. BERT)
- Decoder Only: cross-attention module is removed; this is typically used for sequence generation, such as language modeling (i.e. GPT)

Model Usage of Transformers

- Multi-head self-attention: multiple attention projections are computed and then concatenated into a single $D_m$ representation

- Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position

- Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

Attention in vanilla Transformers

X-formers are improvements to vanilla transformers.  There variants seek improvement from the perspectives of model efficiency (decrease memory and computation complexity), model generalization, and model adaptation

Transformer Variants (X-formers)

The pre-training and fine-tuning paradigm is a method motivated by the goal of creating adaptable, general-purpose systems for universal language understanding and generation. It involves separating the common components of neural network architectures, such as Transformers, and training them on vast amounts of unlabeled data using self-supervision. The resulting systems, known as foundation models, can be easily adapted for specific downstream applications via fine-tuning or prompting. This paradigm shift has enormously transformed natural language processing, meaning that in many cases, large-scale supervised learning for specific tasks is no longer required.

The Pre-training and Fine-tuning Paradigm

Within Natural Language Processing, pre-trained models based on the Transformer are commonly categorized by their underlying architecture. These primary categories, which are targets for self-supervised pre-training approaches, include encoder-only, decoder-only, and encoder-decoder structures.

Architectural Categories of Pre-trained Transformers

The self-attention mechanism, a core component of the Transformer architecture, exhibits a computational complexity that scales quadratically with the length of the input sequence. This characteristic makes it prohibitively expensive and impractical to train or deploy Transformer-based models on tasks involving very long texts.

Computational Cost of Self-Attention in Transformers

The quadratic time complexity inherent in the self-attention mechanism causes Transformer inference to become progressively slower as sequence length increases. This performance issue is particularly pronounced for long sequences, making the standard architecture inefficient for such tasks and motivating the development of faster, more efficient models.

Quadratic Complexity's Impact on Transformer Inference Speed

In Transformer-based systems, the pre-norm architecture is a specific sub-layer configuration where layer normalization is applied internally within a residual block. Because this approach is remarkably effective at stabilizing the training of deep neural networks, it serves as the underlying structural basis for the majority of modern Large Language Models.

Pre-Norm Architecture in Transformers

The Transformer architecture processes all elements of an input sequence simultaneously by calculating interaction scores between every pair of elements. This parallel approach was a significant departure from architectures that process sequences one element at a time. Despite its advantages, this core design choice introduces a major computational limitation. Identify this limitation, explain how it stems directly from the pairwise calculation method, and describe a specific type of task where this limitation would pose a significant challenge.

Critique of the Transformer Architecture's Core Limitation

A research team is building a model to summarize extremely long scientific papers. They are comparing two distinct architectural approaches: 

*   **Approach 1:** Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next. 
*   **Approach 2:** Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context. 

Which of the following statements best analyzes the primary trade-off between these two approaches for this specific task?

A machine translation startup is evaluating two architectural proposals for their new service. Based on the core principles of the Transformer architecture, identify which proposal aligns with its design and explain the fundamental difference in how the two proposals process input information.

Architectural Design Choice for Machine Translation

The advent of neural sequence architectures, specifically Transformers, combined with advancements in large-scale self-supervised learning, has made it possible to achieve universal capabilities in both language understanding and language generation.

Enablers of Universal Language Capabilities

The expressive power of Transformer networks can be effectively enhanced by increasing the model depth, denoted by $$L$$, which represents the total number of stacked processing layers. In standard BERT architectures, the depth $$L$$ is typically configured to either 12 or 24. However, employing networks with even greater depth is a viable strategy to achieve further performance enhancements.

Model Depth in Transformers

Alongside the rise of the Transformer architecture, the concept of language modeling was generalized to encompass models that learn to predict words in various ways, rather than strictly predicting the next token in a sequence. Many powerful Transformer-based models were pre-trained using these diverse word prediction tasks and successfully applied to a wide variety of downstream tasks.

Generalization of the Language Modeling Concept

The primary structure of a Transformer model consists of a stack of Transformer blocks, also referred to as layers. Each individual block is constructed with two stacked sub-layers: one dedicated to self-attention modeling and another for Feed-Forward Network (FFN) modeling. The internal structure of these sub-layers can be implemented using different normalization designs, such as the pre-norm architecture or the post-norm architecture, which is defined mathematically as $$\mathrm{output} = \mathrm{LNorm}(F(\mathrm{input}) + \mathrm{input})$$.

Transformer Block Sub-Layers

The training of Transformer-based language models is generally formulated as a standard neural network optimization task. The goal is to find the optimal model parameters $$\hat{\theta}$$ by maximizing a likelihood-based objective function over a dataset $$\mathcal{D}$$, mathematically expressed as $$\hat{\theta} = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \mathcal{L}_{\theta}(\mathbf{x})$$. This optimization process is typically implemented using gradient descent algorithms, which are well-supported by standard deep learning toolkits.

Standard Optimization Objective for Transformer Language Models

When trained on massive datasets, such as those with hundreds of millions of images, Vision Transformers demonstrate intrinsic superiority in scalability over convolutional architectures like ResNets. In these large-scale scenarios, Vision Transformers outperform ResNets by a significant margin in image classification, proving that scalability and model capacity can trump the need for built-in spatial inductive biases.

Scalability in Vision Transformers

The Transformer is an instance of the encoder-decoder architecture that fundamentally relies on self-attention. Unlike attention mechanisms used in standard sequence-to-sequence learning, the Transformer adds positional encoding to both the input (source) and output (target) sequence embeddings before feeding them into the encoder and decoder, respectively.

Transformer Architecture Overview

To implement a vision Transformer, the input image must be divided into smaller regions called patches. The process of splitting an image into patches and linearly projecting these flattened patches is known as patch embedding. This entire operation can be simplified and implemented as a single two-dimensional convolution operation, where both the kernel size and the stride size are set strictly equal to the patch size.

Patch Embedding in Vision Transformers

Decoder-only Transformers modify the original sequence-to-sequence Transformer architecture by completely removing the encoder component as well as the decoder sublayer responsible for encoder-decoder cross-attention. This streamlined architecture has become the de facto standard for large-scale language modeling, as it can effectively leverage vast amounts of unlabeled text corpora via self-supervised learning.

Decoder-Only Transformer Architecture

Parti is an all-Transformer text-to-image model that demonstrates the potential for Transformer scalability across different modalities. Research indicates that a larger Parti model with more parameters is more capable of generating high-fidelity images and understanding content-rich text, similar to the scalability observed in text-only models.

Parti

A text-to-image model is a multimodal system designed to generate images based on textual descriptions. These models synthesize high-fidelity images by leveraging shared embeddings across text and vision modalities or by utilizing all-Transformer architectures. As these models scale in size, they demonstrate an increased capacity for content-rich text understanding and more accurate visual generation.

Text-to-Image Model

The Transformer is a deep learning architecture built exclusively on attention mechanisms, foregoing traditional recurrent or convolutional layers. A defining property of the Transformer is its superior scaling behavior: its performance consistently improves as the dataset size, model size, and computational budget increase. This architecture has become foundational, driving state-of-the-art results across natural language processing, computer vision, speech recognition, and reinforcement learning.

Google

Carleton College

Northeastern University (US)

Claude

Computing the value for current decode is based on the previous hidden state, the previous word generated, and the current context vector. This context vector is derived from the attention computation based on comparing the previous hidden state to all of the encoder hidden states.

Encoder-decoder network with attention

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Attention Is All You Need

Reference of Foundations of Large Language Models Course

Dive into Deep Learning

One of the main papers that introduces the attention mechanism. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. https://arxiv.org/abs/1409.0473

Neural Machine Translation by Jointly Learning to Align and Translate

Also another main papers on attention mechanism
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

Effective Approaches to Attention-based Neural Machine Translation

Attention is one of the most important innovations in deep learning for the last few years. The papers that introduced this mechanism also can consider the example of machine translation. Let’s quickly review the encoder-decoder architecture. We have an encoder and decoder parts. Encoder part runs an RNN through the input, and returns the final one context vector which we lately use during the decoding phase where we feed it to another RNN as initial hidden inputs. One big problem with that is when sentences get long the performance drops considerably, even though LSTM are supposed to keep the long term information. To fight with long sentences, researchers came up with the technique called attention. 
The attention mechanism tries to mimic our thinking because we are first focusing on different elements in the sentence or in the image before describing what is in there. In this case instead of only one vector benign passed to the decoder, we pass all the hidden layer vectors from each time stamp.


Attention Motivation

Over here is how the model is doing translation from English sentence "How are you?" to Russian. In this case, it is multiplicative attention because it uses current decoder vector to calculate the current word

Example of how Attention is used in Machine Translation

A very good article on transformer model. A very in-depth analysis

http://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

A very good video explaining the transformer model:
https://www.youtube.com/watch?v=rBCqOTEfxvg

Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass

A Colab notebook of how to use Transformer model with Tensor2Tensor library based on Tensorflow:
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb


Tensor2Tensor Intro

The concept of attention helped dramatically improve the seq_to_seq model. There was a lot of improvement and development on that concept since those two main papers on attention got released. One of them is the paper “Attention is All you Need.” The paper introduced a model called “Transformer” that uses only attention mechanisms to work with sequential data - no RNNs. The model they created is subject to parallelization and works very well with GPU. One of the main problems with RNN is that the algorithm is hardly parallelizable because before we can evaluate one time stamp in the encoder we need the previous one. You will see how we can easily parallelize the Transformer model. As before let’s consider the example of machine translation. As before the model consists from two parts:
- Encoders
- Decoders


Transformer model

Transformer

Efficient Transformers: A Survey


Transformers known for their self attention  mechanism and parallelization of sequential data has growing concern over quadratic time and memory complexity.
Efficient transformers address this issue by having better memory capacity and computational costs compared to early stage transformers.

Learn Before

Related

Learn After