General Form of Memory-Based Attention
The attention operation at a specific position that utilizes a memory component to retain contextual information can be expressed in a general form. This operation computes attention using a query vector and a memory model . In standard attention, this memory model is defined as the complete Key-Value (KV) cache up to position , meaning . As a result, the size of is determined directly by the sequence length . The general formula is: .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
General Form of Memory-Based Attention
Fixed-Size Memory for Constant Attention Cost
Multiple Memory Models in Attention
A language model is tasked with processing an extremely long document. How does an attention mechanism that uses a separate, fixed-size memory component to represent context differ from a standard attention mechanism in managing the information from the beginning of the document as it generates new text?
Managing Context in Long-Sequence Generation
Memory Models vs. Efficient Attention for Cache Optimization
Optimizing a Chatbot for Long Conversations
Notation for Key-Value Pairs
Architectural Strategies for Long-Context Processing
Learn After
A language model generates text token by token. At each step 'i', an attention operation computes an output using a query vector and a memory component. In a standard causal implementation, this memory component is defined as the complete set of key and value vectors from all previous steps (1 to i). Based on this definition, what is the direct relationship between the size of this memory component and the length of the generated sequence 'i'?
Sparse Attention with a Fixed Key-Value Subset
Evaluating Memory Models in Attention Mechanisms
Evaluating an Attention Mechanism for a Real-Time Application