Learn Before
Concept
Network Head in Vision Transformers
The network head is the final decision-making component of a vision Transformer architecture. After the full sequence of tokens has been processed by the stack of Transformer encoder blocks, the network extracts only the output representation corresponding to the <cls> token. This specific representation is then projected by the network head—typically implemented as a simple sequence of layer normalization and a linear layer—to produce the final outputs, such as class predictions.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L