Learn Before
Concept
Class Token in Vision Transformers
In vision Transformers, a special learnable vector known as the class token (often denoted as the <cls> token) is concatenated to the sequence of patch embeddings before they are processed by the encoder. As the sequence passes through the stacked encoder blocks, self-attention allows the <cls> token to aggregate information from all the image patches. The final, updated state of this single token is then extracted and used as the comprehensive representation of the entire image for classification.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L