Multiple Choice

An engineer is implementing an attention mechanism where the output is a weighted sum of Value vectors, with weights determined by a Softmax function applied to scores. They observe that as the dimension (d) of the Query and Key vectors increases, the attention weights become extremely concentrated on a single position (e.g., [0.01, 0.98, 0.01]), causing training instability. The scores are derived from the dot product of Query (Q) and Key (K) matrices. What is the most likely cause of this issue?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science