Swin Transformers were developed as a general-purpose backbone network for computer vision to address the quadratic computational complexity of standard self-attention with respect to image size. By reinstating convolution-like priors, Swin Transformers extend the applicability of the Transformer architecture beyond basic image classification, achieving state-of-the-art results across a wide range of computer vision tasks.

Claude

The standard Vision Transformer architecture is less suitable for processing high-resolution images due to the quadratic computational complexity of its self-attention mechanism. As the resolution of the image increases, the sequence length of the flattened patches grows significantly, leading to computationally prohibitive self-attention calculations.

Learn Before

Related