1Cademy - Vision Transformer Encoder Block

Learn Before

Pre-Norm Architecture in Transformers

Concept

Vision Transformer Encoder Block

The vision Transformer encoder block processes sequences of image patches and is characterized by its pre-normalization architecture. Within this block, layer normalization is applied right before both the multi-head attention mechanism and the multilayer perceptron (MLP). This pre-normalization strategy generally leads to more effective and efficient training compared to the post-normalization design found in the original Transformer. Furthermore, similar to standard Transformer blocks, a vision Transformer encoder block preserves the exact shape of its input throughout its operations.

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Implementation of the Vision Transformer Encoder Block
Output Shape of the Vision Transformer Encoder Block
Network Head in Vision Transformers

Learn Before

Related

Learn After