1Cademy - In a mechanism that calculates attention, scores are computed by taking the dot product of a query vector with a set of key vectors. These scores are then scaled by dividing by the square root of the dimension of the vectors (e.g., $\sqrt{d}$) before being passed to a Softmax function. What is the most likely adverse consequence of removing this scaling step, particularly when the vector dimension is large?

Learn Before

Formula for Single-Head Self-Attention

Multiple Choice

In a mechanism that calculates attention, scores are computed by taking the dot product of a query vector with a set of key vectors. These scores are then scaled by dividing by the square root of the dimension of the vectors (e.g., $\sqrt{d}$ ) before being passed to a Softmax function. What is the most likely adverse consequence of removing this scaling step, particularly when the vector dimension is large?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related