Learn Before
Computational Cost Scaling in Attention Mechanisms
Consider two language models processing a very long sequence of text one token at a time. Model A uses an attention mechanism where the memory component it attends to has a constant, predetermined size. Model B uses a standard attention mechanism where the memory component grows to include every previous token. Compare how the computational cost of calculating attention for each new token changes as the sequence gets longer for Model A versus Model B. Explain the fundamental reason for this difference.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Fixed-Size Window Memory as a Form of Local Attention
Summary Vectors for Memory Compression in Attention
General Recurrent Formula for Memory Update
Comparison of Memory Storage in Window-based and Moving Average Caches
Hybrid Cache for Attention Mechanisms
An attention mechanism is designed to use a memory component that has a constant, fixed size, regardless of how long the input sequence becomes. What is the primary computational consequence of this design choice as the input sequence length increases significantly?
Computational Cost Scaling in Attention Mechanisms
Optimizing a Real-Time Sequence Processing Model