1Cademy - An engineer is designing a self-attention layer for a text processing model. They notice that as they increase the dimensionality (`d_k`) of the query and key vectors, the training process becomes unstable, and the gradients used for learning become extremely small. Which of the following best explains this phenomenon and the standard solution implemented within the attention mechanism?

Learn Before

Introduce weight matrices in the transformer

Multiple Choice

An engineer is designing a self-attention layer for a text processing model. They notice that as they increase the dimensionality (d_k) of the query and key vectors, the training process becomes unstable, and the gradients used for learning become extremely small. Which of the following best explains this phenomenon and the standard solution implemented within the attention mechanism?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related