Learn Before
Debugging a Causal Attention Implementation
An engineer is implementing an autoregressive attention mechanism for a sequence of 10 tokens (indexed 0 to 9). When calculating the attention for the token at position i=4, they construct a key matrix that includes the key vectors from all 10 positions (k_0 through k_9). Explain why this approach is incorrect for this type of mechanism. What should be the correct composition of the key matrix for position i=4?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Causal Attention Input Structure
Enumeration of Dot Products in Causal Self-Attention
State Variables in Linear Attention (μ_i, ν_i)
In an autoregressive attention mechanism, a sequence of key vectors is generated. Given the first three key vectors
k_0 = [1, 2],k_1 = [3, 4], andk_2 = [5, 6], which of the following matrices represents the complete set of keys that the query at positioni=2is allowed to interact with?Debugging a Causal Attention Implementation
In an autoregressive attention mechanism processing a sequence of 10 tokens (indexed 0 to 9), the matrix of key vectors used to compute the output for the token at position 3 is identical to the matrix of key vectors used for the token at position 7.