1Cademy - In an attention mechanism, the scores for a query vector `q` are calculated by taking its dot product with a set of key vectors `K`. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formula `Softmax(q * K^T) * V`. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?

Learn Before

Single-Query Attention Computation with Multiplicative Scaling

Multiple Choice

In an attention mechanism, the scores for a query vector q are calculated by taking its dot product with a set of key vectors K. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formula Softmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related