1Cademy - Claimed Linear Time Complexity of Self-Attention in Autoregressive Generation

Learn Before

Computational Cost per Token in Causal Attention

Idea

Claimed Linear Time Complexity of Self-Attention in Autoregressive Generation

An assertion has been made that the time complexity for self-attention in generating a sequence of length len across L layers is linear, specifically $O(L \times len)$ . This claim is based on the computational cost of two key products at each generation step: the dot product between the query and key vectors ( $q'K$ ) and the product of the Softmax output with the value vectors.

Updated 2025-10-07

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn Before

Related