Learn Before
Advantage of Global Tokens in Stabilizing Attention
A key benefit of using global tokens is their ability to stabilize model performance, particularly when processing very long contexts. This stabilization is achieved by smoothing the output distribution of the Softmax function, which is used to calculate attention weights. A smoother probability distribution for the weights helps in preventing erratic model behavior and leads to more stable training and inference.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Information Loss in Fixed-Size Global Memory
Advantage of Global Tokens in Stabilizing Attention
A language model is designed with an efficient attention mechanism where each token can only interact with the 16 tokens immediately preceding and following it. This model performs poorly on tasks that require summarizing a long document, as it fails to connect information from the introduction to the conclusion. Which of the following architectural changes is most specifically designed to solve this type of long-range dependency issue while largely preserving the model's computational efficiency?
Evaluating an Attention Mechanism for Legal Document Processing
In an attention mechanism that uses a fixed number of designated tokens as a form of global memory, continuously increasing the number of these special tokens is a guaranteed strategy to improve model performance on all tasks without introducing any negative consequences.
Learn After
An engineer is training a model on very long documents and observes that the attention mechanism is behaving erratically. The model's focus shifts dramatically between tokens from one training step to the next, leading to poor convergence. A closer look at the attention weight distributions reveals they are often extremely "peaky," with one or two tokens receiving nearly all the weight (e.g., weights like [0.01, 0.98, 0.01]), and the location of this peak changes unpredictably. Which of the following interventions is most likely to mitigate this issue by directly addressing the unstable nature of the attention weight distribution?
Mechanism of Attention Stabilization
The Role of Global Tokens in Mitigating 'Spiky' Attention