Computational Complexity of Vision Transformers on High-Resolution Images
The standard Vision Transformer architecture is less suitable for processing high-resolution images due to the quadratic computational complexity of its self-attention mechanism. As the resolution of the image increases, the sequence length of the flattened patches grows significantly, leading to computationally prohibitive self-attention calculations.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Architectural Adaptation of LLMs for Long Sequences
Quadratic Complexity's Impact on Transformer Inference Speed
Computational Infeasibility of Standard Transformers for Long Sequences
Shared Weight and Shared Activation Methods
Key-Value (KV) Cache in Transformer Inference
Analyzing Model Processing Time
A key component in a modern neural network architecture for processing text has a computational cost that grows quadratically with the length of the input sequence. If processing a sequence of 512 tokens takes 2 seconds on a specific hardware setup, approximately how long would it take to process a sequence of 2048 tokens, assuming all other factors are constant?
Analyzing Computational Scaling
Computational Complexity of Vision Transformers on High-Resolution Images