A team is developing a language model designed to process extremely long sequences, but they are constrained by the computational cost of storing and attending to every previous token's key-value pair. They are evaluating two distinct architectural solutions:
- Solution A: Modify the attention mechanism itself so that each token only attends to a strategically chosen subset of previous tokens, rather than all of them.
- Solution B: Introduce a separate, fixed-size data structure that periodically summarizes and compresses the key-value pairs from older tokens into a condensed representation.
Which statement best analyzes the fundamental difference in how these two solutions address the long-sequence problem?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is developing a language model designed to process extremely long sequences, but they are constrained by the computational cost of storing and attending to every previous token's key-value pair. They are evaluating two distinct architectural solutions:
- Solution A: Modify the attention mechanism itself so that each token only attends to a strategically chosen subset of previous tokens, rather than all of them.
- Solution B: Introduce a separate, fixed-size data structure that periodically summarizes and compresses the key-value pairs from older tokens into a condensed representation.
Which statement best analyzes the fundamental difference in how these two solutions address the long-sequence problem?
Architectural Trade-offs for Long-Context Summarization
Architectural Choice for a Long-Document Q&A System