Learn Before
An attention mechanism calculates normalized weights for a query token against four previous tokens, resulting in weights α₁, α₂, α₃, and α₄. Now, a new version of the mechanism is implemented which is constrained to only consider the second and fourth tokens. The attention scores are re-calculated and re-normalized over just this smaller set, resulting in new weights α'₂ and α'₄. Assuming all original weights were greater than zero, what is the relationship between the new weight for the second token, α'₂, and its original weight, α₂?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An attention mechanism calculates normalized weights for a query token against four previous tokens, resulting in weights α₁, α₂, α₃, and α₄. Now, a new version of the mechanism is implemented which is constrained to only consider the second and fourth tokens. The attention scores are re-calculated and re-normalized over just this smaller set, resulting in new weights α'₂ and α'₄. Assuming all original weights were greater than zero, what is the relationship between the new weight for the second token, α'₂, and its original weight, α₂?
Effect of Sparsity on Attention Weights
Analyzing Attention Weight Redistribution