1Cademy - Comparison of Sparse and Dense Attention Weights

Learn Before

Sparse Attention Output Formula

Comparison

Comparison of Sparse and Dense Attention Weights

In sparse attention, the attention weights are normalized exclusively over the subset of indices defined by the set $G$ . This redistribution of probability mass means that the weights which would have been assigned to ignored tokens are now allocated among the tokens in set $G$ . As a result, the sparse attention weights, $\alpha'_{i,j}$ , for any token $j \in G$ are greater than their counterparts, $\alpha_{i,j}$ , in a standard dense attention calculation: $\alpha'_{i,j} > \alpha_{i,j}$