Learn Before
Gated Combination of Local and k-NN Attention
A learned gating mechanism is a method for integrating the outputs from parallel attention computations over local and -NN memories. This approach uses a gating vector, , to dynamically weigh the contributions of the local and -NN attention outputs. This gating vector, also known as a coefficient vector, is typically the output of a learned gate function. The final combined attention output is calculated through a linear combination controlled by the gate. The process is defined by the following equations:
where the local and -NN attention components are defined as:
Here, represents element-wise multiplication. This allows the model to decide how much to rely on immediate context () versus long-term retrieved context () for each query .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Gated Combination of Local and k-NN Attention
An advanced language model is designed to be a conversational partner while also having access to a vast external knowledge base. When processing a user's query, the model employs a dual-path architecture:
- One path calculates attention over the recent conversational history (the "local context").
- A parallel path performs a similarity search on the external knowledge base to find the most relevant documents and then calculates attention over the content of those documents. The outputs from both paths are then integrated to form the final response.
What is the primary architectural advantage of processing local context and retrieved knowledge in two separate, parallel streams?
Architectural Solution for Long-Term Context
A language model architecture is designed to process a query by using two parallel computational streams: one that computes attention over a local memory of recent context, and another that searches an external datastore for relevant information. Match each architectural component with its primary function in this process.
Learn After
A language model architecture combines information from two sources: an 'immediate context' output and a 'retrieved knowledge' output. It uses a learned gating vector,
g, to dynamically weigh these sources. The final output is calculated using the formula:Output = g ⊙ [immediate_context_output] + (1 - g) ⊙ [retrieved_knowledge_output], where⊙is element-wise multiplication. If, during a specific task, the values in the gating vectorgare consistently close to 0.0, what does this imply about the model's behavior for that task?Advantage of a Learned Gating Mechanism
Calculating a Gated Attention Output