In attention pooling, the width of the kernel dictates the smoothness of the estimate and its responsiveness to local variations. A narrower kernel restricts the range of large attention weights, leading to a less smooth estimate that adapts more closely to local data variations. Conversely, a wider kernel distributes attention weights more broadly, resulting in a smoother overall estimate.

Kernel Width Effect on Attention Pooling

The Gaussian kernel for attention pooling can be adapted by incorporating a width parameter, denoted by $$\sigma^2$$. The modified formula is $$\alpha(\mathbf{q}, \mathbf{k}) = \exp\left(-\frac{1}{2 \sigma^2} \|\mathbf{q} - \mathbf{k}\|^2 ight)$$. Adjusting this width parameter allows control over how smoothly the attention weights decay with respect to the distance between the query $$\mathbf{q}$$ and the key $$\mathbf{k}$$.

Claude

The Gaussian kernel for attention pooling is defined by the formula $$\alpha(x) = \exp(-x^2 / 2)$$. It is a translation and rotation invariant kernel that assigns smoothly decaying weights to observations based on their distance from the origin.

Gaussian Attention Kernel

Dive into Deep Learning

Gaussian Attention Kernel with Variable Width

The attention scoring function (without exponentiation) derived from the Gaussian kernel is mathematically expressed as $$a(\mathbf{q}, \mathbf{k}_i) = -\frac{1}{2} \|\mathbf{q} - \mathbf{k}_i\|^2 = \mathbf{q}^	op \mathbf{k}_i -\frac{1}{2} \|\mathbf{k}_i\|^2 -\frac{1}{2} \|\mathbf{q}\|^2$$. Because the final term ($$-\frac{1}{2} \|\mathbf{q}\|^2$$) depends exclusively on the query and remains identical for all query-key pairs, it disappears completely when attention weights are normalized to $$1$$ (e.g., via the softmax operation). Furthermore, when key vectors are generated using techniques like batch or layer normalization, their norms ($$\|\mathbf{k}_i\|$$) become well-bounded and essentially constant, allowing the key-dependent term to be safely dropped without a major change in the outcome. By eliminating both of these norm terms, the Gaussian kernel conceptually simplifies into the standard dot product attention scoring function: $$a(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^	op \mathbf{k}_i$$.

Learn Before

Related

Learn After