Derivation of Quadratic Complexity in Autoregressive Attention
When an autoregressive model generates a sequence of text, the computational cost for the self-attention mechanism at each step is linear relative to the number of tokens already generated. However, the total computational cost for generating the entire sequence grows quadratically with the final sequence length. Explain precisely why this is the case, and how the number of layers in the model affects this total cost.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is optimizing a text-generation model where the computational cost is dominated by the self-attention mechanism during autoregressive decoding. They need to decide between two potential upgrades:
- Upgrade A: Doubling the number of layers in the model while keeping the maximum sequence length the same.
- Upgrade B: Doubling the maximum sequence length the model can handle while keeping the number of layers the same.
Assuming the model generates a sequence that fills its maximum length capacity in both scenarios, which upgrade would lead to a greater increase in the total computation time, and what is the nature of that increase?
Derivation of Quadratic Complexity in Autoregressive Attention
Performance Bottleneck in a Generative Model
Vector Products per Self-Attention Step