1Cademy - Sparse Attention Mechanisms

Pattern A: Each token can only attend to the 16 tokens immediately preceding and following it.
Pattern B: In addition to its local neighbors, each token can also attend to a few pre-selected &#x27;global&#x27; tokens that are distributed across the entire sequence (e.g., the first token, the last token).

Learn Before

Efficient Attention Models

Concept

Sparse Attention Mechanisms

Sparse attention mechanisms are a class of efficient methods developed to address the quadratic time complexity of the standard self-attention in Transformers. Instead of allowing every token to attend to every other token, these mechanisms restrict the attention to a smaller, sparser set of connections, thereby reducing the computational cost and making inference more efficient for long sequences.

Updated 2025-10-06

Contributors are: