Learn Before
Comparison

Comparison of Dense and Sparse Attention Matrices

The structure of the attention weight matrix, α\alpha, is a primary differentiator between attention mechanisms. This matrix determines how the output is computed as a weighted sum of Value vectors (V\mathbf{V}) via the general attention formula: Attqkv(Q,K,V)=α(Q,K)VAtt_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \alpha(\mathbf{Q}, \mathbf{K})\mathbf{V} In standard dense attention, the α\alpha matrix is fully populated with non-zero values that all contribute to the output. Conversely, sparse attention is based on the premise that most entries in the α\alpha matrix can be treated as zero, with only a select subset of non-zero weights being used in the computation.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related