When applying a patch embedding operation to an input image with a height and width of $$	ext{img\_size}$$, using a specific $$	ext{patch\_size}$$, the resulting sequence will contain $$(	ext{img\_size} // 	ext{patch\_size})^2$$ patches. Each of these patches is then linearly projected into a vector of a fixed length, commonly denoted as $$	ext{num\_hiddens}$$.

Output Shape of Patch Embedding in Vision Transformers

In deep learning frameworks, patch embedding can be implemented as a neural network module. The core mechanism is a 2D convolution layer where both the kernel size and stride are set to the desired patch size. The output of the convolution is then flattened spatially and transposed to produce a sequence of patch representations.

```python
# PyTorch
class PatchEmbedding(nn.Module):
    def __init__(self, img_size=96, patch_size=16, num_hiddens=512):
        super().__init__()
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size,
                                  stride=patch_size)

    def forward(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        return self.conv(X).flatten(2).transpose(1, 2)

# JAX
class PatchEmbedding(nn.Module):
    img_size: int = 96
    patch_size: int = 16
    num_hiddens: int = 512

    def setup(self):
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(self.img_size), _make_tuple(self.patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.Conv(self.num_hiddens, kernel_size=patch_size,
                            strides=patch_size, padding='SAME')

    def __call__(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        X = self.conv(X)
        return X.reshape((X.shape[0], -1, X.shape[3]))
```

Implementation of Patch Embedding in Vision Transformers

In vision Transformers, a special learnable vector known as the class token (often denoted as the `<cls>` token) is concatenated to the sequence of patch embeddings before they are processed by the encoder. As the sequence passes through the stacked encoder blocks, self-attention allows the `<cls>` token to aggregate information from all the image patches. The final, updated state of this single token is then extracted and used as the comprehensive representation of the entire image for classification.

Class Token in Vision Transformers

To implement a vision Transformer, the input image must be divided into smaller regions called patches. The process of splitting an image into patches and linearly projecting these flattened patches is known as patch embedding. This entire operation can be simplified and implemented as a single two-dimensional convolution operation, where both the kernel size and the stride size are set strictly equal to the patch size.

Claude

The Transformer is a deep learning architecture built exclusively on attention mechanisms, foregoing traditional recurrent or convolutional layers. A defining property of the Transformer is its superior scaling behavior: its performance consistently improves as the dataset size, model size, and computational budget increase. This architecture has become foundational, driving state-of-the-art results across natural language processing, computer vision, speech recognition, and reinforcement learning.

Transformer

Dive into Deep Learning

 a self-attention layer maps input sequences(x1,...,xn) to output sequences of the same length (${y_1},...,{y_n}$). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one. 
In the case of self-attention, the set of comparisons are to other elements within a given sequence. The simplest form of comparison between elements in a self-attention layer is a dot product:
$score({x_i}, {x_j}) = {x_i}· {x_j}$
The larger the value the more similar the vectors that are being compared. Then to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, $α_{ij}$, that indicates the proportional relevance of each input to the input element i that is the current focus of attention.
$α_{ij} = \frac{exp(score({x_i}, {x_j}))}{\sum_{k=1}^i exp(score({x_i}, {x_k}))}$, ${\forall}$ j ≤ i
Given the proportional scores in α, we then generate an output value yi by taking the sum of the inputs seen so far, weighted by their respective α value. 
${y_i} =\sum_{j≤i} α_{ij}{x_j}$

Self-attention layers' first approach 

The transformer model can also be used for the contextual generation task and text summarization task.

During the contextual generation, the model is given some prefix text and will output a possible completion to it. The transformer model can have direct access to all the prefix text and the subsequently generated output of its own.

As for the text summarization task, the training set contains multiple full-length articles accompanied by their summaries with a unique marker separating these two parts, where one training unit is like $$(x_1,...,x_m,δ,y_1,...y_n)$$. Teacher-forcing also applies during the training.


Transformers in contextual generation and summarization


https://huggingface.co/docs/transformers/model_summary

Huggingface Model Summary

Lin, Tianyang & Wang, Yuxin & Liu, Xiangyang & Qiu, Xipeng. (2021). A Survey of Transformers. 

A Survey of Transformers (Lin et. al, 2021)

- Encoder-Decoder: sequence to sequence (language modeling)
- Encoder Only: outputs of the encoder are utilized as a representation for the input sequence. This is usually used for classification or sequence labeling problems (i.e. BERT)
- Decoder Only: cross-attention module is removed; this is typically used for sequence generation, such as language modeling (i.e. GPT)

Model Usage of Transformers

- Multi-head self-attention: multiple attention projections are computed and then concatenated into a single $D_m$ representation

- Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position

- Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

Attention in vanilla Transformers

X-formers are improvements to vanilla transformers.  There variants seek improvement from the perspectives of model efficiency (decrease memory and computation complexity), model generalization, and model adaptation

Transformer Variants (X-formers)

The pre-training and fine-tuning paradigm is a method motivated by the goal of creating adaptable, general-purpose systems for universal language understanding and generation. It involves separating the common components of neural network architectures, such as Transformers, and training them on vast amounts of unlabeled data using self-supervision. The resulting systems, known as foundation models, can be easily adapted for specific downstream applications via fine-tuning or prompting. This paradigm shift has enormously transformed natural language processing, meaning that in many cases, large-scale supervised learning for specific tasks is no longer required.

The Pre-training and Fine-tuning Paradigm

Within Natural Language Processing, pre-trained models based on the Transformer are commonly categorized by their underlying architecture. These primary categories, which are targets for self-supervised pre-training approaches, include encoder-only, decoder-only, and encoder-decoder structures.

Architectural Categories of Pre-trained Transformers

The self-attention mechanism, a core component of the Transformer architecture, exhibits a computational complexity that scales quadratically with the length of the input sequence. This characteristic makes it prohibitively expensive and impractical to train or deploy Transformer-based models on tasks involving very long texts.

Computational Cost of Self-Attention in Transformers

The quadratic time complexity inherent in the self-attention mechanism causes Transformer inference to become progressively slower as sequence length increases. This performance issue is particularly pronounced for long sequences, making the standard architecture inefficient for such tasks and motivating the development of faster, more efficient models.

Quadratic Complexity's Impact on Transformer Inference Speed

In Transformer-based systems, the pre-norm architecture is a specific sub-layer configuration where layer normalization is applied internally within a residual block. Because this approach is remarkably effective at stabilizing the training of deep neural networks, it serves as the underlying structural basis for the majority of modern Large Language Models.

Pre-Norm Architecture in Transformers

The Transformer architecture processes all elements of an input sequence simultaneously by calculating interaction scores between every pair of elements. This parallel approach was a significant departure from architectures that process sequences one element at a time. Despite its advantages, this core design choice introduces a major computational limitation. Identify this limitation, explain how it stems directly from the pairwise calculation method, and describe a specific type of task where this limitation would pose a significant challenge.

Critique of the Transformer Architecture's Core Limitation

A research team is building a model to summarize extremely long scientific papers. They are comparing two distinct architectural approaches: 

*   **Approach 1:** Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next. 
*   **Approach 2:** Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context. 

Which of the following statements best 

A machine translation startup is evaluating two architectural proposals for their new service. Based on the core principles of the Transformer architecture, identify which proposal aligns with its design and explain the fundamental difference in how the two proposals process input information.

Architectural Design Choice for Machine Translation

The advent of neural sequence architectures, specifically Transformers, combined with advancements in large-scale self-supervised learning, has made it possible to achieve universal capabilities in both language understanding and language generation.

Enablers of Universal Language Capabilities

The expressive power of Transformer networks can be effectively enhanced by increasing the model depth, denoted by $$L$$, which represents the total number of stacked processing layers. In standard BERT architectures, the depth $$L$$ is typically configured to either 12 or 24. However, employing networks with even greater depth is a viable strategy to achieve further performance enhancements.

Model Depth in Transformers

Alongside the rise of the Transformer architecture, the concept of language modeling was generalized to encompass models that learn to predict words in various ways, rather than strictly predicting the next token in a sequence. Many powerful Transformer-based models were pre-trained using these diverse word prediction tasks and successfully applied to a wide variety of downstream tasks.

Generalization of the Language Modeling Concept

The primary structure of a Transformer model consists of a stack of Transformer blocks, also referred to as layers. Each individual block is constructed with two stacked sub-layers: one dedicated to self-attention modeling and another for Feed-Forward Network (FFN) modeling. The internal structure of these sub-layers can be implemented using different normalization designs, such as the pre-norm architecture or the post-norm architecture, which is defined mathematically as $$\mathrm{output} = \mathrm{LNorm}(F(\mathrm{input}) + \mathrm{input})$$.

Transformer Block Sub-Layers

The training of Transformer-based language models is generally formulated as a standard neural network optimization task. The goal is to find the optimal model parameters $$\hat{\theta}$$ by maximizing a likelihood-based objective function over a dataset $$\mathcal{D}$$, mathematically expressed as $$\hat{\theta} = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \mathcal{L}_{\theta}(\mathbf{x})$$. This optimization process is typically implemented using gradient descent algorithms, which are well-supported by standard deep learning toolkits.

Standard Optimization Objective for Transformer Language Models

When trained on massive datasets, such as those with hundreds of millions of images, Vision Transformers demonstrate intrinsic superiority in scalability over convolutional architectures like ResNets. In these large-scale scenarios, Vision Transformers outperform ResNets by a significant margin in image classification, proving that scalability and model capacity can trump the need for built-in spatial inductive biases.

Scalability in Vision Transformers

The Transformer is an instance of the encoder-decoder architecture that fundamentally relies on self-attention. Unlike attention mechanisms used in standard sequence-to-sequence learning, the Transformer adds positional encoding to both the input (source) and output (target) sequence embeddings before feeding them into the encoder and decoder, respectively.

Transformer Architecture Overview

Patch Embedding in Vision Transformers

Decoder-only Transformers modify the original sequence-to-sequence Transformer architecture by completely removing the encoder component as well as the decoder sublayer responsible for encoder-decoder cross-attention. This streamlined architecture has become the de facto standard for large-scale language modeling, as it can effectively leverage vast amounts of unlabeled text corpora via self-supervised learning.

Decoder-Only Transformer Architecture

Parti is an all-Transformer text-to-image model that demonstrates the potential for Transformer scalability across different modalities. Research indicates that a larger Parti model with more parameters is more capable of generating high-fidelity images and understanding content-rich text, similar to the scalability observed in text-only models.

Parti

A text-to-image model is a multimodal system designed to generate images based on textual descriptions. These models synthesize high-fidelity images by leveraging shared embeddings across text and vision modalities or by utilizing all-Transformer architectures. As these models scale in size, they demonstrate an increased capacity for content-rich text understanding and more accurate visual generation.

Learn Before

Related

Learn After