1Cademy - Global Tokens in Attention

Learn Before

Multiple Memory Models in Attention
Sparse Attention

Concept

Global Tokens in Attention

Global tokens are a widely-used technique in attention mechanisms for combining local and long-term context. This approach designates a few tokens at the beginning of a sequence as 'global,' making them accessible to all other tokens during attention calculations. Often implemented alongside sparse attention models, this method serves as a form of global memory. It offers the advantage of stabilizing model performance by smoothing the Softmax distribution of attention weights, but it also introduces a trade-off: the fixed size of this global memory can lead to information loss, creating a tension between representational capacity and computational cost.

Updated 2026-05-02

Contributors are: