Learn Before
Concept

Advantage of Global Tokens in Stabilizing Attention

A key benefit of using global tokens is their ability to stabilize model performance, particularly when processing very long contexts. This stabilization is achieved by smoothing the output distribution of the Softmax function, which is used to calculate attention weights. A smoother probability distribution for the weights helps in preventing erratic model behavior and leads to more stable training and inference.

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences