Combined KV Cache for k-NN and Local Memory
One straightforward method for integrating retrieved -NN memory is to concatenate it with the local memory. In this approach, the local memory () and the -NN memory () are combined to form a single, larger Key-Value cache, represented as . The model then performs a standard query-key-value attention operation on this unified cache for a given query .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined KV Cache for k-NN and Local Memory
k-NN Search Augmented Attention
Optimizing Attention for a Specialized Chatbot
A team is designing a language model for a legal chatbot. The model must be able to follow the immediate flow of a user's query while also referencing specific, relevant legal precedents from a massive, static database. Which of the following approaches for the model's attention mechanism best addresses this dual requirement?
Diagnosing Memory Deficiencies in a Chatbot
Linear Combination of Local and External Attention
A language model is designed to answer complex questions about a large, internal database of scientific papers. When it receives a user's question, it first performs a search to find the most relevant paragraphs from the papers. These selected paragraphs are then used to help the model formulate its final answer. Which statement best analyzes the primary role of these retrieved paragraphs in the model's internal processing?
Combined KV Cache for k-NN and Local Memory
A language model is equipped with an external memory system to enhance its responses. Arrange the following steps in the correct chronological order to show how the model uses this system to process a user's query.
Improving a Customer Support Chatbot
Learn After
A language model is designed to use both its recent conversational history (local memory) and relevant facts retrieved from a large knowledge base (retrieved memory). The chosen integration method is to simply concatenate the Key-Value pairs from both sources into a single, larger memory block before the attention mechanism processes them. What is the most significant architectural trade-off of this specific approach?
Analyzing Computational Cost of a Memory Integration Strategy
Evaluating a Memory Integration Method