Concept

Computational Complexity of Vision Transformers on High-Resolution Images

The standard Vision Transformer architecture is less suitable for processing high-resolution images due to the quadratic computational complexity of its self-attention mechanism. As the resolution of the image increases, the sequence length of the flattened patches grows significantly, leading to computationally prohibitive self-attention calculations.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Learn After