Short Answer

Effect of Sparsity on Attention Weights

An attention mechanism calculates the following dense attention weights for a specific query token against four preceding key tokens:

  • Token 1: 0.10
  • Token 2: 0.40
  • Token 3: 0.20
  • Token 4: 0.30

The mechanism is then modified to be sparse, considering only Token 2 and Token 4. The attention scores are re-calculated and re-normalized over just this smaller set. Explain the fundamental reason why the new, sparse attention weight for Token 2 will be greater than its original dense weight of 0.40.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science