Learn Before
Formula

Gated Combination of Local and k-NN Attention

A learned gating mechanism is a method for integrating the outputs from parallel attention computations over local and kk-NN memories. This approach uses a gating vector, gRd\mathbf{g} \in \mathbb{R}^d, to dynamically weigh the contributions of the local and kk-NN attention outputs. This gating vector, also known as a coefficient vector, is typically the output of a learned gate function. The final combined attention output is calculated through a linear combination controlled by the gate. The process is defined by the following equations:

Att(qi,Mem,Memknn)=gAttlocal+(1g)Attknn\mathrm{Att}(\mathbf{q}_i, \mathrm{Mem}, \mathrm{Mem}_{k\mathrm{nn}}) = \mathbf{g} \odot \mathrm{Att}_{\mathrm{local}} + (1 - \mathbf{g}) \odot \mathrm{Att}_{k\mathrm{nn}}

where the local and kk-NN attention components are defined as:

Attlocal=Att(qi,Mem)\mathrm{Att}_{\mathrm{local}} = \mathrm{Att}(\mathbf{q}_i, \mathrm{Mem})

Attknn=Att(qi,Memknn)\mathrm{Att}_{k\mathrm{nn}} = \mathrm{Att}(\mathbf{q}_i, \mathrm{Mem}_{k\mathrm{nn}})

Here, \odot represents element-wise multiplication. This allows the model to decide how much to rely on immediate context (Mem\mathrm{Mem}) versus long-term retrieved context (Memknn\mathrm{Mem}_{k\mathrm{nn}}) for each query qi\mathbf{q}_i.

Image 0

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences