Learn Before
Concept
Heuristics for Attention Kernel Width
While using a uniform kernel width might not be ideal across all data points, various heuristics exist to adapt the width dynamically. For example, Silverman (1986) proposed a heuristic that adjusts the kernel width based on local data density. Additionally, Norelli et al. (2022) applied similar nearest-neighbor interpolation techniques to design cross-modal representations for images and text, demonstrating the ongoing relevance of these density-dependent adaptations.
0
1
Updated 2026-05-14
Tags
D2L
Dive into Deep Learning @ D2L