1Cademy - Variance Control in Dot Product Attention

Learn Before

Scaled Dot-Product Attention

Concept

Variance Control in Dot Product Attention

When calculating dot product attention, it is essential to manage the magnitude of the scores before they are processed by the exponential function (softmax) to avoid vanishing gradients. Assuming that all elements of a query vector $\mathbf{q} \in \mathbb{R}^d$ and a key vector $\mathbf{k}_i \in \mathbb{R}^d$ are independent and identically distributed random variables with a mean of $0$ and a variance of $1$ , their resulting dot product will have a mean of $0$ but a variance of $d$ . Because this variance scales linearly with the vector dimensionality $d$ , the raw dot product values can become excessively large, pushing the softmax function into saturated regions. To prevent this and ensure the variance of the dot product remains $1$ regardless of the vector length, the dot product is divided by $\sqrt{d}$ . This critical stabilization step produces the scaled dot-product attention scoring function: $a(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^ op \mathbf{k}_i / \sqrt{d}$ .

0

1

Updated 2026-05-14

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related