When using a constant attention kernel in Nadaraya-Watson regression, the resulting estimate is simply the global average of all training labels, given by the trivial formula $$f(x) = \frac{1}{n} \sum_i y_i$$. Because it assigns equal weight to every data point regardless of its distance from the query, this trivial estimate produces an unrealistic result that fails to capture local variations in the underlying data.

Claude

The constant kernel for attention pooling is defined by the formula $$\alpha(x) = 1.0 + 0 \times x$$. It is a translation and rotation invariant kernel that assigns a uniform, constant weight of 1.0 to all observations, regardless of their distance.

Learn Before

Related