1Cademy - Comparison of Dense and Sparse Attention Matrices

Learn Before

Sparse Attention

Comparison

Comparison of Dense and Sparse Attention Matrices

The structure of the attention weight matrix, $\alpha$ , is a primary differentiator between attention mechanisms. This matrix determines how the output is computed as a weighted sum of Value vectors ( $\mathbf{V}$ ) via the general attention formula: $Att_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \alpha(\mathbf{Q}, \mathbf{K})\mathbf{V}$ In standard dense attention, the $\alpha$ matrix is fully populated with non-zero values that all contribute to the output. Conversely, sparse attention is based on the premise that most entries in the $\alpha$ matrix can be treated as zero, with only a select subset of non-zero weights being used in the computation.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After