Multiple Choice

An engineer is training a model on very long documents and observes that the attention mechanism is behaving erratically. The model's focus shifts dramatically between tokens from one training step to the next, leading to poor convergence. A closer look at the attention weight distributions reveals they are often extremely "peaky," with one or two tokens receiving nearly all the weight (e.g., weights like [0.01, 0.98, 0.01]), and the location of this peak changes unpredictably. Which of the following interventions is most likely to mitigate this issue by directly addressing the unstable nature of the attention weight distribution?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science