1Cademy - Quadratic Complexitys Impact on Transformer Inference Speed

Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.

Learn Before

Transformer
Computational Cost of Self-Attention in Transformers

Causation

Quadratic Complexity's Impact on Transformer Inference Speed

The quadratic time complexity inherent in the self-attention mechanism causes Transformer inference to become progressively slower as sequence length increases. This performance issue is particularly pronounced for long sequences, making the standard architecture inefficient for such tasks and motivating the development of faster, more efficient models.

Updated 2026-05-02

Contributors are: