Formula

Gaussian Attention Kernel with Variable Width

The Gaussian kernel for attention pooling can be adapted by incorporating a width parameter, denoted by σ2\sigma^2. The modified formula is \alpha(\mathbf{q}, \mathbf{k}) = \exp\left(-\frac{1}{2 \sigma^2} \|\mathbf{q} - \mathbf{k}\|^2 ight). Adjusting this width parameter allows control over how smoothly the attention weights decay with respect to the distance between the query q\mathbf{q} and the key k\mathbf{k}.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L