Concept

Vision Transformer Encoder Block

The vision Transformer encoder block processes sequences of image patches and is characterized by its pre-normalization architecture. Within this block, layer normalization is applied right before both the multi-head attention mechanism and the multilayer perceptron (MLP). This pre-normalization strategy generally leads to more effective and efficient training compared to the post-normalization design found in the original Transformer. Furthermore, similar to standard Transformer blocks, a vision Transformer encoder block preserves the exact shape of its input throughout its operations.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related