When applied to Nadaraya-Watson regression, nontrivial attention kernels such as the Gaussian, Boxcar, and Epanechikov kernels produce very similar, workable estimates that are not too far from the true underlying function. This similarity arises because, despite having different mathematical functional forms, these kernels yield very comparable attention weights after normalization, causing them to pool the data values in a roughly equivalent manner.

Claude

The Nadaraya-Watson estimator provides intuition for how attention mechanisms work in a regression setting. In this context, the query corresponds to the target location where the regression is being carried out. The keys represent the locations where past data was previously observed, and the values represent the actual observed regression values themselves. This framework demonstrates how an attention mechanism can aggregate data by pooling these (key, value) pairs based on a given query.

Nadaraya-Watson Estimator

Dive into Deep Learning

Attention kernels $$\alpha(\mathbf{k}, \mathbf{q})$$ that are translation and rotation invariant remain unchanged in value if the key $$\mathbf{k}$$ and query $$\mathbf{q}$$ are shifted and rotated in the same manner. For simplicity, when scalar arguments $$k, q \in \mathbb{R}$$ are chosen and the key $$k = 0$$ is picked as the origin, the kernel can be expressed as a function of the query, yielding various scalar kernel functions that correspond to different notions of range and smoothness.

Translation and Rotation Invariant Attention Kernels

To observe Nadaraya-Watson estimation in action, a synthetic training dataset can be generated using a specific nonlinear dependency. A common dependency used for this purpose is $$y_i = 2\sin(x_i) + x_i + \epsilon$$, where the noise term $$\epsilon$$ is drawn from a standard normal distribution with zero mean and unit variance. For example, a dataset might consist of $$40$$ training examples sampled from this distribution.

Nadaraya-Watson Estimation Synthetic Data

To compute Nadaraya-Watson kernel regression estimates, a programmatic function can be defined that first calculates the distance between all training features (keys) and validation features (queries). These distances are passed through a chosen kernel function to yield a matrix, which is subsequently normalized across the keys for each query. The resulting normalized relative kernel weights are the attention weights, and when they are multiplied by the training labels (values), the function returns the final regression estimates.

Nadaraya-Watson Estimator Code

Similarity of Nontrivial Attention Kernels

Nadaraya-Watson kernel regression serves as an early precursor to modern attention mechanisms. It can be applied directly to regression or classification tasks with little to no prior training or hyperparameter tuning. In this framework, the attention weight is assigned based on the similarity (or distance) between a query and a key, as well as the availability of similar observations in the dataset.

Learn Before

Related