1Cademy - Computational Complexity of Vision Transformers on High-Resolution Images

Learn Before

Computational Cost of Self-Attention in Transformers

Concept

Computational Complexity of Vision Transformers on High-Resolution Images

The standard Vision Transformer architecture is less suitable for processing high-resolution images due to the quadratic computational complexity of its self-attention mechanism. As the resolution of the image increases, the sequence length of the flattened patches grows significantly, leading to computationally prohibitive self-attention calculations.

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Swin Transformer

Learn Before

Related

Learn After