1Cademy - Global Tokens for Attention

Learn Before

Sparse Attention

Concept

Global Tokens for Attention

A widely-used technique for combining local and long-range context is to designate the first few tokens of a sequence as 'global tokens'. These tokens are made accessible to all other tokens during the attention calculation, effectively serving as a form of global memory. This method is frequently implemented in conjunction with sparse attention models.

Updated 2026-04-23

Contributors are: