1Cademy - State Variables in Linear Attention (μ_i, ν

Learn Before

Key Matrix for Causal Attention (K_≤i)

Definition

State Variables in Linear Attention (μ_i, ν_i)

In certain linear attention variants, the entire history of key-value pairs up to a position $i$ is summarized by two state variables: $\mu_i$ and $\nu_i$ . The state $\mu_i$ is the cumulative sum of outer products between transformed key vectors and their corresponding value vectors ( $\sum_{j=0}^{i} \mathbf{k'}_j^T \mathbf{v}_j$ ). The state $\nu_i$ is the cumulative sum of the transformed key vectors ( $\sum_{j=0}^{i} \mathbf{k'}_j^T$ ). These states allow the attention mechanism to operate without re-accessing the full history at each step.

Updated 2026-06-28

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

In a simplified attention mechanism, the history of key-value pairs up to a position i is summarized by two state variables: μ_i, which is the cumulative sum of outer products between transformed key vectors and their corresponding value vectors (Σ k'_jᵀ v_j), and ν_i, which is the cumulative sum of the transformed key vectors (Σ k'_jᵀ).

Given the following sequence of 2-dimensional vectors up to position i=2:

k'_0 = [1, 0], v_0 = [3, 4] k'_1 = [0, 2], v_1 = [5, 6] k'_2 = [1, 1], v_2 =
In a specific type of attention mechanism, the history of key-value pairs up to a position i is summarized by two state variables: a matrix μ_i and a vector ν_i. They are defined as cumulative sums:

μ_i = Σ_{j=0 to i} (k'_jᵀ * v_j) (sum of outer products) ν_i = Σ_{j=0 to i} (k'_jᵀ) (sum of transformed key vectors)

Suppose you have already computed the state variables μ_i and ν_i for a sequence up to position i. To compute the next state variables, μ_{i+1} and ν_{i+1}, what
Computational Advantage of State Variables

Learn Before

Related

Learn After