Concept

Heuristics for Attention Kernel Width

While using a uniform kernel width might not be ideal across all data points, various heuristics exist to adapt the width dynamically. For example, Silverman (1986) proposed a heuristic that adjusts the kernel width based on local data density. Additionally, Norelli et al. (2022) applied similar nearest-neighbor interpolation techniques to design cross-modal representations for images and text, demonstrating the ongoing relevance of these density-dependent adaptations.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L