Memory Representation in Attention Mechanisms
A language model is designed to process extremely long documents. To manage computational costs, its attention mechanism uses a fixed-size memory component. One implementation stores the raw key-value pairs from the last 100 tokens. An alternative implementation creates a pair of 'summary vectors' by mathematically combining information from all preceding tokens into a fixed-size representation. Compare these two approaches in terms of the type of historical information each one preserves and the type of information each one might lose.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Moving Average of Keys and Values for Memory Component
Weighted Moving Average for Memory Component
Cumulative Average of Keys and Values for Memory Component
An engineer is designing a language model that must process very long sequences while keeping the computational cost of attention constant at each step. They are considering two approaches for the model's memory component:
- Approach 1: The memory stores the raw key-value pairs from the 256 most recent positions in the sequence.
- Approach 2: The memory is a pair of fixed-size 'summary' vectors, which are calculated by mathematically combining all preceding key-value pairs into a single, condensed representation.
Which statement best analyzes the primary trade-off between these two approaches?
Memory Representation in Attention Mechanisms
Recurrent Update for Memory Caching
Optimizing Memory for Long-Sequence Processing