Learn Before
An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position i (q_i) computes attention over the key-value pairs from all preceding positions, from 1 to i. Consider a modified implementation where the query q_i is restricted to attend only to the key-value pairs from the very first position (k_1, v_1) and its own current position (k_i, v_i). How does the computational cost of calculating the attention output for a single query q_i scale as the sequence length i grows very large (e.g., from 100 to 10,000)?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position
i(q_i) computes attention over the key-value pairs from all preceding positions, from 1 toi. Consider a modified implementation where the queryq_iis restricted to attend only to the key-value pairs from the very first position (k_1,v_1) and its own current position (k_i,v_i). How does the computational cost of calculating the attention output for a single queryq_iscale as the sequence lengthigrows very large (e.g., from 100 to 10,000)?Trade-offs in Attention Mechanisms
Optimizing Attention for Long-Sequence Processing