1Cademy - An engineer is training a model on very long documents and observes that the attention mechanism is behaving erratically. The models focus shifts dramatically between tokens from one training step to the next, leading to poor convergence. A closer look at the attention weight distributions reveals they are often extremely peaky, with one or two tokens receiving nearly all the weight (e.g., weights like [0.01, 0.98, 0.01]), and the location of this peak changes unpredictably. Which of the following interventions is most likely to mitigate this issue by directly addressing the unstable nature of the attention weight distribution?

Learn Before

Advantage of Global Tokens in Stabilizing Attention

Multiple Choice

An engineer is training a model on very long documents and observes that the attention mechanism is behaving erratically. The model's focus shifts dramatically between tokens from one training step to the next, leading to poor convergence. A closer look at the attention weight distributions reveals they are often extremely "peaky," with one or two tokens receiving nearly all the weight (e.g., weights like [0.01, 0.98, 0.01]), and the location of this peak changes unpredictably. Which of the following interventions is most likely to mitigate this issue by directly addressing the unstable nature of the attention weight distribution?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related