Learn Before
Sparse Attention Output Formula
For language models employing sparse attention, the output for a query token at position is calculated as a weighted sum of value vectors. This computation is restricted to a specific subset of indices, , where the attention weights are considered non-zero. The formula for the sparse attention output is:
Here, and represent the keys and values up to position . The summation iterates only over the indices in the sparse set , applying the non-zero sparse attention weights to their corresponding value vectors .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Sparse Attention Output Formula
A causal model is calculating the output for the token at position
i=3. The model's attention mechanism is optimized to only consider a subset of previous positions. The set of contributing indices isG = {0, 2}. The attention weights for these indices areα_3,0 = 0.6andα_3,2 = 0.4. The value vectors for the relevant positions are:v_0 = [1, 0],v_1 = [2, 2], andv_2 = [0, 3]. Based on this information, what is the final output vector for position 3?Evaluating Vector Contributions in an Optimized Attention Mechanism
Selective Computation in Optimized Attention
Index Set of Non-Zero Attention Weights ()
Learn After
Comparison of Sparse and Dense Attention Weights
A language model is calculating an output vector using a sparse attention mechanism. The computation for the current token only considers a subset of previous tokens, identified by the index set G = {0, 2, 3}. Given the value vectors and corresponding attention weights below, what is the correct output vector?
Value Vectors:
- v_0 = [2, 1]
- v_1 = [4, 5]
- v_2 = [6, 0]
- v_3 = [1, 3]
Attention Weights for the included set G:
- α'_0 = 0.5
- α'_2 = 0.2
- α'_3 = 0.3
Analysis of Sparse Attention Formula Components
Analyzing the Impact of the Sparse Index Set