Concept

Variance Control in Dot Product Attention

When calculating dot product attention, it is essential to manage the magnitude of the scores before they are processed by the exponential function (softmax) to avoid vanishing gradients. Assuming that all elements of a query vector q∈Rd\mathbf{q} \in \mathbb{R}^d and a key vector ki∈Rd\mathbf{k}_i \in \mathbb{R}^d are independent and identically distributed random variables with a mean of 00 and a variance of 11, their resulting dot product will have a mean of 00 but a variance of dd. Because this variance scales linearly with the vector dimensionality dd, the raw dot product values can become excessively large, pushing the softmax function into saturated regions. To prevent this and ensure the variance of the dot product remains 11 regardless of the vector length, the dot product is divided by d\sqrt{d}. This critical stabilization step produces the scaled dot-product attention scoring function: a(q,ki)=qopki/da(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^ op \mathbf{k}_i / \sqrt{d}.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related