Learn Before
Linear Attention Output Calculation
In this variant of linear attention, the final output is calculated by combining the current transformed query vector with the accumulated state variables and . The numerator is the product of the query and the key-value state , while the denominator is the product of the query and the key state , serving as a normalization term. The formula is: This approach replaces the standard Softmax operation with simpler matrix-vector products, leading to computational savings.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Computational and Memory Efficiency of Linear Attention's Recurrent Method
A sequential model updates two history-representing variables, μ and ν, at each step
iusing the following rules:μ_i = μ_{i-1} + k'i^T * v_i ν_i = ν{i-1} + k'_i^T
Consider the update at a single step
i. If the input value vectorv_iis a zero vector (a vector of all zeros), but the input key vectork'_iis a non-zero vector, what is the outcome of the update from stepi-1to stepi?Recurrent State Update Calculation
Unrolling a Recurrent State Update
Linear Attention Output Calculation
Learn After
In the formula for calculating a linear attention output,
Output = (q'_i * μ_i) / (q'_i * ν_i), whereq'_iis the transformed query,μ_iis the accumulated key-value state, andν_iis the accumulated key state, what is the primary role of the denominator termq'_i * ν_i?Calculating a Linear Attention Output Vector