1Cademy - Analyzing Attention Weight Redistribution

Learn Before

Comparison of Sparse and Dense Attention Weights

Case Study

Analyzing Attention Weight Redistribution

An engineer is analyzing a language model's attention mechanism. For a specific query token, the model initially calculates the following normalized attention weights (dense attention) over four previous tokens:

Token 1: 0.1
Token 2: 0.5
Token 3: 0.3
Token 4: 0.1

To improve efficiency, the engineer modifies the mechanism to only attend to Token 2 and Token 3, ignoring Tokens 1 and 4. The attention scores for Tokens 2 and 3 are re-calculated and re-normalized over this smaller set.

Predict the effect of this change on the attention weight for Token 2. Will the new, sparse attention weight for Token 2 be greater than, less than, or equal to its original weight of 0.5? Justify your answer based on the principle of probability mass distribution.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Learn Before

Related