Multiple Choice

In an attention mechanism, the scores for a query vector q are calculated by taking its dot product with a set of key vectors K. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formula Softmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science