Learn Before
Computational Advantage of State Variables
In a specific variant of an attention mechanism, the entire history of key-value pairs up to a position i is summarized by two cumulative state variables: a matrix μ_i (sum of outer products of keys and values) and a vector ν_i (sum of keys). This allows the calculation for the current step to be performed using only the state from the previous step and the current key-value pair, without re-accessing the full history.
Analyze the following two scenarios and determine in which one this summarization method offers a more significant computational advantage compared to a standard attention mechanism that re-scans the entire history at each step. Justify your reasoning.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a simplified attention mechanism, the history of key-value pairs up to a position
iis summarized by two state variables:μ_i, which is the cumulative sum of outer products between transformed key vectors and their corresponding value vectors (Σ k'_jᵀ v_j), andν_i, which is the cumulative sum of the transformed key vectors (Σ k'_jᵀ).Given the following sequence of 2-dimensional vectors up to position
i=2:k'_0 = [1, 0], v_0 = [3, 4] k'_1 = [0, 2], v_1 = [5, 6] k'_2 = [1, 1], v_2 = [7, 8]
Calculate the state variables
μ_2andν_2.In a specific type of attention mechanism, the history of key-value pairs up to a position
iis summarized by two state variables: a matrixμ_iand a vectorν_i. They are defined as cumulative sums:μ_i = Σ_{j=0 to i} (k'_jᵀ * v_j)(sum of outer products)ν_i = Σ_{j=0 to i} (k'_jᵀ)(sum of transformed key vectors)Suppose you have already computed the state variables
μ_iandν_ifor a sequence up to positioni. To compute the next state variables,μ_{i+1}andν_{i+1}, what is the only additional information you need?Computational Advantage of State Variables