Multiple Choice

In a mechanism that calculates attention, scores are computed by taking the dot product of a query vector with a set of key vectors. These scores are then scaled by dividing by the square root of the dimension of the vectors (e.g., d\sqrt{d}) before being passed to a Softmax function. What is the most likely adverse consequence of removing this scaling step, particularly when the vector dimension is large?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science