Activity (Process)

Forward Pass of Vision Transformers

The forward pass of a vision Transformer sequentially integrates its core components. Initially, input images are processed by a patch embedding module, and the resulting sequence is concatenated with a learnable <cls> token. Next, learnable positional embeddings are added to the sequence, followed by dropout. The sequence is then fed into a Transformer encoder consisting of a stack of multiple encoder blocks. Ultimately, the network extracts the processed representation of the <cls> token and passes it through a network head to generate the final projection.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L