k-NN Search Augmented Attention
k-NN search augmented attention is an architecture that processes a query vector, qi, through two parallel computational streams. The first stream computes standard attention over a local memory, such as the KV cache, to capture immediate context. Concurrently, the second stream uses qi to perform a k-NN search on an external datastore, retrieving the k nearest neighbors. A separate attention calculation is then performed over these retrieved neighbors. Finally, the outputs from these two streams are integrated to produce a final result. This integration can occur at different levels, for instance, by combining the attention output vectors directly or by converting each stream's output into a probability distribution and then merging them.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined KV Cache for k-NN and Local Memory
k-NN Search Augmented Attention
Optimizing Attention for a Specialized Chatbot
A team is designing a language model for a legal chatbot. The model must be able to follow the immediate flow of a user's query while also referencing specific, relevant legal precedents from a massive, static database. Which of the following approaches for the model's attention mechanism best addresses this dual requirement?
Diagnosing Memory Deficiencies in a Chatbot
Linear Combination of Local and External Attention
k-NN Search Augmented Attention
In a memory retrieval system, a query is compared against a large datastore of key-value pairs to find the 'k' most similar keys. The corresponding key-value pairs are then returned. What is the primary effect of increasing the value of 'k'?
Optimizing a Chatbot's Retrieval System
A system is designed to retrieve information from a datastore of key-value pairs using a nearest-neighbor approach. Arrange the following steps of this retrieval process in the correct logical sequence.
Learn After
Gated Combination of Local and k-NN Attention
An advanced language model is designed to be a conversational partner while also having access to a vast external knowledge base. When processing a user's query, the model employs a dual-path architecture:
- One path calculates attention over the recent conversational history (the "local context").
- A parallel path performs a similarity search on the external knowledge base to find the most relevant documents and then calculates attention over the content of those documents. The outputs from both paths are then integrated to form the final response.
What is the primary architectural advantage of processing local context and retrieved knowledge in two separate, parallel streams?
Architectural Solution for Long-Term Context
A language model architecture is designed to process a query by using two parallel computational streams: one that computes attention over a local memory of recent context, and another that searches an external datastore for relevant information. Match each architectural component with its primary function in this process.