1Cademy - Gated Combination of Local and k-NN Attention

Learn Before

k-NN Search Augmented Attention

Formula

Gated Combination of Local and k-NN Attention

A learned gating mechanism is a method for integrating the outputs from parallel attention computations over local and $k$ -NN memories. This approach uses a gating vector, $\mathbf{g} \in \mathbb{R}^d$ , to dynamically weigh the contributions of the local and $k$ -NN attention outputs. This gating vector, also known as a coefficient vector, is typically the output of a learned gate function. The final combined attention output is calculated through a linear combination controlled by the gate. The process is defined by the following equations:

$\mathrm{Att}(\mathbf{q}_i, \mathrm{Mem}, \mathrm{Mem}_{k\mathrm{nn}}) = \mathbf{g} \odot \mathrm{Att}_{\mathrm{local}} + (1 - \mathbf{g}) \odot \mathrm{Att}_{k\mathrm{nn}}$

where the local and $k$ -NN attention components are defined as:

$\mathrm{Att}_{\mathrm{local}} = \mathrm{Att}(\mathbf{q}_i, \mathrm{Mem})$

$\mathrm{Att}_{k\mathrm{nn}} = \mathrm{Att}(\mathbf{q}_i, \mathrm{Mem}_{k\mathrm{nn}})$

Here, $\odot$ represents element-wise multiplication. This allows the model to decide how much to rely on immediate context ( $\mathrm{Mem}$ ) versus long-term retrieved context ( $\mathrm{Mem}_{k\mathrm{nn}}$ ) for each query $\mathbf{q}_i$ .

0

1

Updated 2026-04-23

Contributors are:

Who are from:

References

Learn Before

Related

Learn After