Because self-attention operations are permutation-invariant, vision Transformers must explicitly incorporate spatial information. After concatenating the sequence of patch embeddings with the class token, learnable positional embeddings are added to the token representations. Unlike fixed sine and cosine frequencies used in the original Transformer, these positional embeddings are learned parameters that are summed directly with the input tokens before dropout is applied and the sequence is fed into the encoder.

Positional Embeddings in Vision Transformers

In vision Transformers, a special learnable vector known as the class token (often denoted as the `<cls>` token) is concatenated to the sequence of patch embeddings before they are processed by the encoder. As the sequence passes through the stacked encoder blocks, self-attention allows the `<cls>` token to aggregate information from all the image patches. The final, updated state of this single token is then extracted and used as the comprehensive representation of the entire image for classification.

Claude

To implement a vision Transformer, the input image must be divided into smaller regions called patches. The process of splitting an image into patches and linearly projecting these flattened patches is known as patch embedding. This entire operation can be simplified and implemented as a single two-dimensional convolution operation, where both the kernel size and the stride size are set strictly equal to the patch size.

Patch Embedding in Vision Transformers

Dive into Deep Learning

When applying a patch embedding operation to an input image with a height and width of $$	ext{img\_size}$$, using a specific $$	ext{patch\_size}$$, the resulting sequence will contain $$(	ext{img\_size} // 	ext{patch\_size})^2$$ patches. Each of these patches is then linearly projected into a vector of a fixed length, commonly denoted as $$	ext{num\_hiddens}$$.

Output Shape of Patch Embedding in Vision Transformers

In deep learning frameworks, patch embedding can be implemented as a neural network module. The core mechanism is a 2D convolution layer where both the kernel size and stride are set to the desired patch size. The output of the convolution is then flattened spatially and transposed to produce a sequence of patch representations.

```python
# PyTorch
class PatchEmbedding(nn.Module):
    def __init__(self, img_size=96, patch_size=16, num_hiddens=512):
        super().__init__()
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size,
                                  stride=patch_size)

    def forward(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        return self.conv(X).flatten(2).transpose(1, 2)

# JAX
class PatchEmbedding(nn.Module):
    img_size: int = 96
    patch_size: int = 16
    num_hiddens: int = 512

    def setup(self):
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(self.img_size), _make_tuple(self.patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.Conv(self.num_hiddens, kernel_size=patch_size,
                            strides=patch_size, padding='SAME')

    def __call__(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        X = self.conv(X)
        return X.reshape((X.shape[0], -1, X.shape[3]))
```

Learn Before

Related

Learn After