Learn Before
Formula
Gaussian Attention Kernel with Variable Width
The Gaussian kernel for attention pooling can be adapted by incorporating a width parameter, denoted by . The modified formula is \alpha(\mathbf{q}, \mathbf{k}) = \exp\left(-\frac{1}{2 \sigma^2} \|\mathbf{q} - \mathbf{k}\|^2 ight). Adjusting this width parameter allows control over how smoothly the attention weights decay with respect to the distance between the query and the key .
0
1
Updated 2026-05-14
Tags
D2L
Dive into Deep Learning @ D2L