To implement translation and rotation invariant attention kernels in practice, programmatic functions are defined for the Gaussian, Boxcar, Constant, and Epanechikov kernels. These functions take scalar distance inputs and return the corresponding attention weights ($$\alpha$$), providing distinct computational notions of range and smoothness for attention pooling operations.

Claude

Attention kernels $$\alpha(\mathbf{k}, \mathbf{q})$$ that are translation and rotation invariant remain unchanged in value if the key $$\mathbf{k}$$ and query $$\mathbf{q}$$ are shifted and rotated in the same manner. For simplicity, when scalar arguments $$k, q \in \mathbb{R}$$ are chosen and the key $$k = 0$$ is picked as the origin, the kernel can be expressed as a function of the query, yielding various scalar kernel functions that correspond to different notions of range and smoothness.

Translation and Rotation Invariant Attention Kernels

Dive into Deep Learning

The Gaussian kernel for attention pooling is defined by the formula $$\alpha(x) = \exp(-x^2 / 2)$$. It is a translation and rotation invariant kernel that assigns smoothly decaying weights to observations based on their distance from the origin.

Gaussian Attention Kernel

The boxcar kernel is an attention pooling kernel defined by the formula $$\alpha(x) = 1 \text{ if } |x| < 1.0 \text{ else } 0$$. It acts as an indicator function, indiscriminately attending only to observations within a distance of 1 (or another defined hyperparameter) and assigning a weight of zero to all other observations.

Boxcar Attention Kernel

The constant kernel for attention pooling is defined by the formula $$\alpha(x) = 1.0 + 0 \times x$$. It is a translation and rotation invariant kernel that assigns a uniform, constant weight of 1.0 to all observations, regardless of their distance.

Constant Attention Kernel

The Epanechikov kernel for attention pooling is defined by the formula $$\alpha(x) = \max(1 - |x|, 0)$$. It is a translation and rotation invariant kernel that linearly decays as the distance |x| increases up to 1, and assigns a weight of exactly zero to observations beyond a distance of 1. This provides a strictly bounded notion of range and smoothness.

Learn Before

Related