Learn Before
Activity (Process)
Forward Pass of Vision Transformers
The forward pass of a vision Transformer sequentially integrates its core components. Initially, input images are processed by a patch embedding module, and the resulting sequence is concatenated with a learnable <cls> token. Next, learnable positional embeddings are added to the sequence, followed by dropout. The sequence is then fed into a Transformer encoder consisting of a stack of multiple encoder blocks. Ultimately, the network extracts the processed representation of the <cls> token and passes it through a network head to generate the final projection.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L