Learn Before
Summary Vectors for Memory Compression in Attention
An alternative to using a sliding window for the memory component (Mem) is to define it as a pair of summary vectors. This approach creates a more compressed representation of the sequence's history, rather than storing a subset of the raw key-value pairs.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Fixed-Size Window Memory as a Form of Local Attention
Summary Vectors for Memory Compression in Attention
General Recurrent Formula for Memory Update
Comparison of Memory Storage in Window-based and Moving Average Caches
Hybrid Cache for Attention Mechanisms
An attention mechanism is designed to use a memory component that has a constant, fixed size, regardless of how long the input sequence becomes. What is the primary computational consequence of this design choice as the input sequence length increases significantly?
Computational Cost Scaling in Attention Mechanisms
Optimizing a Real-Time Sequence Processing Model
Learn After
Moving Average of Keys and Values for Memory Component
Weighted Moving Average for Memory Component
Cumulative Average of Keys and Values for Memory Component
An engineer is designing a language model that must process very long sequences while keeping the computational cost of attention constant at each step. They are considering two approaches for the model's memory component:
- Approach 1: The memory stores the raw key-value pairs from the 256 most recent positions in the sequence.
- Approach 2: The memory is a pair of fixed-size 'summary' vectors, which are calculated by mathematically combining all preceding key-value pairs into a single, condensed representation.
Which statement best analyzes the primary trade-off between these two approaches?
Memory Representation in Attention Mechanisms
Recurrent Update for Memory Caching
Optimizing Memory for Long-Sequence Processing