1Cademy - Time Complexity of Self-Attention in Autoregressive Generation

Upgrade A: Doubling the number of layers in the model while keeping the maximum sequence length the same.
Upgrade B: Doubling the maximum sequence length the model can handle while keeping the number of layers the same.

Learn Before

Computational Cost per Token in Causal Attention

Formula

Time Complexity of Self-Attention in Autoregressive Generation

The overall time complexity for the self-attention mechanism when generating a sequence of length len with an L-layer Transformer is $O(L \times len^2)$ . This quadratic complexity arises from summing the linear computational cost ( $O(i')$ ) for each of the len generation steps. The total complexity is then multiplied by L, as this entire process is repeated for each layer in the Transformer stack.