Learn Before
State Variables in Linear Attention (μ_i, ν_i)
In certain linear attention variants, the entire history of key-value pairs up to a position is summarized by two state variables: and . The state is the cumulative sum of outer products between transformed key vectors and their corresponding value vectors (). The state is the cumulative sum of the transformed key vectors (). These states allow the attention mechanism to operate without re-accessing the full history at each step.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Causal Attention Input Structure
Enumeration of Dot Products in Causal Self-Attention
State Variables in Linear Attention (μ_i, ν_i)
In an autoregressive attention mechanism, a sequence of key vectors is generated. Given the first three key vectors
k_0 = [1, 2],k_1 = [3, 4], andk_2 = [5, 6], which of the following matrices represents the complete set of keys that the query at positioni=2is allowed to interact with?Debugging a Causal Attention Implementation
In an autoregressive attention mechanism processing a sequence of 10 tokens (indexed 0 to 9), the matrix of key vectors used to compute the output for the token at position 3 is identical to the matrix of key vectors used for the token at position 7.
Learn After
In a simplified attention mechanism, the history of key-value pairs up to a position
iis summarized by two state variables:μ_i, which is the cumulative sum of outer products between transformed key vectors and their corresponding value vectors (Σ k'_jᵀ v_j), andν_i, which is the cumulative sum of the transformed key vectors (Σ k'_jᵀ).Given the following sequence of 2-dimensional vectors up to position
i=2:k'_0 = [1, 0], v_0 = [3, 4] k'_1 = [0, 2], v_1 = [5, 6] k'_2 = [1, 1], v_2 = [7, 8]
Calculate the state variables
μ_2andν_2.In a specific type of attention mechanism, the history of key-value pairs up to a position
iis summarized by two state variables: a matrixμ_iand a vectorν_i. They are defined as cumulative sums:μ_i = Σ_{j=0 to i} (k'_jᵀ * v_j)(sum of outer products)ν_i = Σ_{j=0 to i} (k'_jᵀ)(sum of transformed key vectors)Suppose you have already computed the state variables
μ_iandν_ifor a sequence up to positioni. To compute the next state variables,μ_{i+1}andν_{i+1}, what is the only additional information you need?Computational Advantage of State Variables