1Cademy - Derivation of Dot Product Attention from Gaussian Kernel

Learn Before

Gaussian Attention Kernel

Concept

Derivation of Dot Product Attention from Gaussian Kernel

The attention scoring function (without exponentiation) derived from the Gaussian kernel is mathematically expressed as $a(\mathbf{q}, \mathbf{k}_i) = -\frac{1}{2} \|\mathbf{q} - \mathbf{k}_i\|^2 = \mathbf{q}^ op \mathbf{k}_i -\frac{1}{2} \|\mathbf{k}_i\|^2 -\frac{1}{2} \|\mathbf{q}\|^2$ . Because the final term ( $-\frac{1}{2} \|\mathbf{q}\|^2$ ) depends exclusively on the query and remains identical for all query-key pairs, it disappears completely when attention weights are normalized to $1$ (e.g., via the softmax operation). Furthermore, when key vectors are generated using techniques like batch or layer normalization, their norms ( $\|\mathbf{k}_i\|$ ) become well-bounded and essentially constant, allowing the key-dependent term to be safely dropped without a major change in the outcome. By eliminating both of these norm terms, the Gaussian kernel conceptually simplifies into the standard dot product attention scoring function: $a(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^ op \mathbf{k}_i$ .

0

1

Updated 2026-05-14

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Dot Product Attention

Learn Before

Related

Learn After