Concept

Derivation of Dot Product Attention from Gaussian Kernel

The attention scoring function (without exponentiation) derived from the Gaussian kernel is mathematically expressed as a(q,ki)=12qki2=qopki12ki212q2a(\mathbf{q}, \mathbf{k}_i) = -\frac{1}{2} \|\mathbf{q} - \mathbf{k}_i\|^2 = \mathbf{q}^ op \mathbf{k}_i -\frac{1}{2} \|\mathbf{k}_i\|^2 -\frac{1}{2} \|\mathbf{q}\|^2. Because the final term (12q2-\frac{1}{2} \|\mathbf{q}\|^2) depends exclusively on the query and remains identical for all query-key pairs, it disappears completely when attention weights are normalized to 11 (e.g., via the softmax operation). Furthermore, when key vectors are generated using techniques like batch or layer normalization, their norms (ki\|\mathbf{k}_i\|) become well-bounded and essentially constant, allowing the key-dependent term to be safely dropped without a major change in the outcome. By eliminating both of these norm terms, the Gaussian kernel conceptually simplifies into the standard dot product attention scoring function: a(q,ki)=qopkia(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^ op \mathbf{k}_i.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Learn After