In an attention mechanism, the scores for a query vector q are calculated by taking its dot product with a set of key vectors K. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formula Softmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Causal Attention
In an attention mechanism, the scores for a query vector
qare calculated by taking its dot product with a set of key vectorsK. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formulaSoftmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?Calculating Pre-Softmax Attention Scores
Applying Scaled Dot-Product Attention