Learn Before
Concept
Positional Embeddings in Vision Transformers
Because self-attention operations are permutation-invariant, vision Transformers must explicitly incorporate spatial information. After concatenating the sequence of patch embeddings with the class token, learnable positional embeddings are added to the token representations. Unlike fixed sine and cosine frequencies used in the original Transformer, these positional embeddings are learned parameters that are summed directly with the input tokens before dropout is applied and the sequence is fed into the encoder.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L