Learn Before
Effect of Sparsity on Attention Weights
An attention mechanism calculates the following dense attention weights for a specific query token against four preceding key tokens:
- Token 1: 0.10
- Token 2: 0.40
- Token 3: 0.20
- Token 4: 0.30
The mechanism is then modified to be sparse, considering only Token 2 and Token 4. The attention scores are re-calculated and re-normalized over just this smaller set. Explain the fundamental reason why the new, sparse attention weight for Token 2 will be greater than its original dense weight of 0.40.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An attention mechanism calculates normalized weights for a query token against four previous tokens, resulting in weights α₁, α₂, α₃, and α₄. Now, a new version of the mechanism is implemented which is constrained to only consider the second and fourth tokens. The attention scores are re-calculated and re-normalized over just this smaller set, resulting in new weights α'₂ and α'₄. Assuming all original weights were greater than zero, what is the relationship between the new weight for the second token, α'₂, and its original weight, α₂?
Effect of Sparsity on Attention Weights
Analyzing Attention Weight Redistribution