To verify the output shape of a patch embedding, an instance of the embedding module can be applied to a dummy input tensor. For example, given images with a height and width of 96, a patch size of 16, a batch size of 4, and a hidden vector length of 512, the expected output shape is $$(4, (96 // 16)^2, 512)$$, which simplifies to $$(4, 36, 512)$$.

```python
# PyTorch
img_size, patch_size, num_hiddens, batch_size = 96, 16, 512, 4
patch_emb = PatchEmbedding(img_size, patch_size, num_hiddens)
X = torch.zeros(batch_size, 3, img_size, img_size)
d2l.check_shape(patch_emb(X), (batch_size, (img_size // patch_size)**2, num_hiddens))

# JAX
img_size, patch_size, num_hiddens, batch_size = 96, 16, 512, 4
patch_emb = PatchEmbedding(img_size, patch_size, num_hiddens)
X = jnp.zeros((batch_size, img_size, img_size, 3))
output, _ = patch_emb.init_with_output(d2l.get_key(), X)
d2l.check_shape(output, (batch_size, (img_size // patch_size)**2, num_hiddens))
```

Example of Patch Embedding Output Shape in Vision Transformers

When applying a patch embedding operation to an input image with a height and width of $$	ext{img\_size}$$, using a specific $$	ext{patch\_size}$$, the resulting sequence will contain $$(	ext{img\_size} // 	ext{patch\_size})^2$$ patches. Each of these patches is then linearly projected into a vector of a fixed length, commonly denoted as $$	ext{num\_hiddens}$$.

Claude

To implement a vision Transformer, the input image must be divided into smaller regions called patches. The process of splitting an image into patches and linearly projecting these flattened patches is known as patch embedding. This entire operation can be simplified and implemented as a single two-dimensional convolution operation, where both the kernel size and the stride size are set strictly equal to the patch size.

Patch Embedding in Vision Transformers

Dive into Deep Learning

Output Shape of Patch Embedding in Vision Transformers

In deep learning frameworks, patch embedding can be implemented as a neural network module. The core mechanism is a 2D convolution layer where both the kernel size and stride are set to the desired patch size. The output of the convolution is then flattened spatially and transposed to produce a sequence of patch representations.

```python
# PyTorch
class PatchEmbedding(nn.Module):
    def __init__(self, img_size=96, patch_size=16, num_hiddens=512):
        super().__init__()
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size,
                                  stride=patch_size)

    def forward(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        return self.conv(X).flatten(2).transpose(1, 2)

# JAX
class PatchEmbedding(nn.Module):
    img_size: int = 96
    patch_size: int = 16
    num_hiddens: int = 512

    def setup(self):
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(self.img_size), _make_tuple(self.patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.Conv(self.num_hiddens, kernel_size=patch_size,
                            strides=patch_size, padding='SAME')

    def __call__(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        X = self.conv(X)
        return X.reshape((X.shape[0], -1, X.shape[3]))
```

Implementation of Patch Embedding in Vision Transformers

In vision Transformers, a special learnable vector known as the class token (often denoted as the `<cls>` token) is concatenated to the sequence of patch embeddings before they are processed by the encoder. As the sequence passes through the stacked encoder blocks, self-attention allows the `<cls>` token to aggregate information from all the image patches. The final, updated state of this single token is then extracted and used as the comprehensive representation of the entire image for classification.

Learn Before

Related

Learn After