Learn Before
Comparison of Sparse and Dense Attention Weights
In sparse attention, the attention weights are normalized exclusively over the subset of indices defined by the set . This redistribution of probability mass means that the weights which would have been assigned to ignored tokens are now allocated among the tokens in set . As a result, the sparse attention weights, , for any token are greater than their counterparts, , in a standard dense attention calculation:

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Sparse and Dense Attention Weights
A language model is calculating an output vector using a sparse attention mechanism. The computation for the current token only considers a subset of previous tokens, identified by the index set G = {0, 2, 3}. Given the value vectors and corresponding attention weights below, what is the correct output vector?
Value Vectors:
- v_0 = [2, 1]
- v_1 = [4, 5]
- v_2 = [6, 0]
- v_3 = [1, 3]
Attention Weights for the included set G:
- α'_0 = 0.5
- α'_2 = 0.2
- α'_3 = 0.3
Analysis of Sparse Attention Formula Components
Analyzing the Impact of the Sparse Index Set
Learn After
An attention mechanism calculates normalized weights for a query token against four previous tokens, resulting in weights α₁, α₂, α₃, and α₄. Now, a new version of the mechanism is implemented which is constrained to only consider the second and fourth tokens. The attention scores are re-calculated and re-normalized over just this smaller set, resulting in new weights α'₂ and α'₄. Assuming all original weights were greater than zero, what is the relationship between the new weight for the second token, α'₂, and its original weight, α₂?
Effect of Sparsity on Attention Weights
Analyzing Attention Weight Redistribution