The forward pass of a vision Transformer sequentially integrates its core components. Initially, input images are processed by a patch embedding module, and the resulting sequence is concatenated with a learnable `<cls>` token. Next, learnable positional embeddings are added to the sequence, followed by dropout. The sequence is then fed into a Transformer encoder consisting of a stack of multiple encoder blocks. Ultimately, the network extracts the processed representation of the `<cls>` token and passes it through a network head to generate the final projection.

Forward Pass of Vision Transformers

Because self-attention operations are permutation-invariant, vision Transformers must explicitly incorporate spatial information. After concatenating the sequence of patch embeddings with the class token, learnable positional embeddings are added to the token representations. Unlike fixed sine and cosine frequencies used in the original Transformer, these positional embeddings are learned parameters that are summed directly with the input tokens before dropout is applied and the sequence is fed into the encoder.

Claude

In vision Transformers, a special learnable vector known as the class token (often denoted as the `<cls>` token) is concatenated to the sequence of patch embeddings before they are processed by the encoder. As the sequence passes through the stacked encoder blocks, self-attention allows the `<cls>` token to aggregate information from all the image patches. The final, updated state of this single token is then extracted and used as the comprehensive representation of the entire image for classification.

Learn Before

Related

Learn After