Learn Before
Time Complexity of Self-Attention in Autoregressive Generation
The overall time complexity for the self-attention mechanism when generating a sequence of length len with an L-layer Transformer is . This quadratic complexity arises from summing the linear computational cost () for each of the len generation steps. The total complexity is then multiplied by L, as this entire process is repeated for each layer in the Transformer stack.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Time Complexity of Self-Attention in Autoregressive Generation
Claimed Linear Time Complexity of Self-Attention in Autoregressive Generation
In a model that generates text one token at a time, suppose it has already produced a sequence of length
Nand is now calculating the next token (at positionN+1). Which of the following best identifies the two primary computational operations within the attention mechanism that cause the cost of this single step to scale linearly with the current sequence lengthN?Analyzing Generation Latency
Predicting Attention Computation Time
Learn After
A team is optimizing a text-generation model where the computational cost is dominated by the self-attention mechanism during autoregressive decoding. They need to decide between two potential upgrades:
- Upgrade A: Doubling the number of layers in the model while keeping the maximum sequence length the same.
- Upgrade B: Doubling the maximum sequence length the model can handle while keeping the number of layers the same.
Assuming the model generates a sequence that fills its maximum length capacity in both scenarios, which upgrade would lead to a greater increase in the total computation time, and what is the nature of that increase?
Derivation of Quadratic Complexity in Autoregressive Attention
Performance Bottleneck in a Generative Model
Vector Products per Self-Attention Step