Performance Bottleneck in a Generative Model
A developer is profiling a text-generation model that uses a multi-layer Transformer architecture. They observe the following performance characteristics during autoregressive generation, where the time measured is dominated by the self-attention computations. Based on this data, identify the mathematical relationship between the generated sequence length and the computation time, and explain why this relationship occurs.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is optimizing a text-generation model where the computational cost is dominated by the self-attention mechanism during autoregressive decoding. They need to decide between two potential upgrades:
- Upgrade A: Doubling the number of layers in the model while keeping the maximum sequence length the same.
- Upgrade B: Doubling the maximum sequence length the model can handle while keeping the number of layers the same.
Assuming the model generates a sequence that fills its maximum length capacity in both scenarios, which upgrade would lead to a greater increase in the total computation time, and what is the nature of that increase?
Derivation of Quadratic Complexity in Autoregressive Attention
Performance Bottleneck in a Generative Model
Vector Products per Self-Attention Step