Comparison

Comparison of Sparse and Dense Attention Weights

In sparse attention, the attention weights are normalized exclusively over the subset of indices defined by the set GG. This redistribution of probability mass means that the weights which would have been assigned to ignored tokens are now allocated among the tokens in set GG. As a result, the sparse attention weights, αi,j\alpha'_{i,j}, for any token jGj \in G are greater than their counterparts, αi,j\alpha_{i,j}, in a standard dense attention calculation: αi,j>αi,j\alpha'_{i,j} > \alpha_{i,j}

Image 0

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences