Learn Before
Trade-offs in Attention Mechanisms
An engineer proposes a modification to a standard autoregressive language model's attention mechanism. In this new design, for any given token being generated at position i, its query vector q_i will only calculate attention scores with the key-value pairs from the very first position (k_1, v_1) and its own position (k_i, v_i), instead of all preceding positions. Evaluate the primary advantage and the most significant potential disadvantage of this proposed modification.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position
i(q_i) computes attention over the key-value pairs from all preceding positions, from 1 toi. Consider a modified implementation where the queryq_iis restricted to attend only to the key-value pairs from the very first position (k_1,v_1) and its own current position (k_i,v_i). How does the computational cost of calculating the attention output for a single queryq_iscale as the sequence lengthigrows very large (e.g., from 100 to 10,000)?Trade-offs in Attention Mechanisms
Optimizing Attention for Long-Sequence Processing