Vector Products per Self-Attention Step
During a single step of standard autoregressive generation, attending a position to all previous context positions requires exactly vector products. This total is comprised of products needed for the query-key dot product (), plus an additional products to multiply the Softmax-normalized attention scores with the value matrix ().
0
1
Tags
Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A team is optimizing a text-generation model where the computational cost is dominated by the self-attention mechanism during autoregressive decoding. They need to decide between two potential upgrades:
- Upgrade A: Doubling the number of layers in the model while keeping the maximum sequence length the same.
- Upgrade B: Doubling the maximum sequence length the model can handle while keeping the number of layers the same.
Assuming the model generates a sequence that fills its maximum length capacity in both scenarios, which upgrade would lead to a greater increase in the total computation time, and what is the nature of that increase?
Derivation of Quadratic Complexity in Autoregressive Attention
Performance Bottleneck in a Generative Model
Vector Products per Self-Attention Step