Learn Before
Analyzing Attention Weight Redistribution
An engineer is analyzing a language model's attention mechanism. For a specific query token, the model initially calculates the following normalized attention weights (dense attention) over four previous tokens:
- Token 1: 0.1
- Token 2: 0.5
- Token 3: 0.3
- Token 4: 0.1
To improve efficiency, the engineer modifies the mechanism to only attend to Token 2 and Token 3, ignoring Tokens 1 and 4. The attention scores for Tokens 2 and 3 are re-calculated and re-normalized over this smaller set.
Predict the effect of this change on the attention weight for Token 2. Will the new, sparse attention weight for Token 2 be greater than, less than, or equal to its original weight of 0.5? Justify your answer based on the principle of probability mass distribution.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An attention mechanism calculates normalized weights for a query token against four previous tokens, resulting in weights α₁, α₂, α₃, and α₄. Now, a new version of the mechanism is implemented which is constrained to only consider the second and fourth tokens. The attention scores are re-calculated and re-normalized over just this smaller set, resulting in new weights α'₂ and α'₄. Assuming all original weights were greater than zero, what is the relationship between the new weight for the second token, α'₂, and its original weight, α₂?
Effect of Sparsity on Attention Weights
Analyzing Attention Weight Redistribution