Learn Before
Difficulty of Training Transformers on Long Sequences
Training Transformer-based models becomes exceptionally challenging when dealing with extremely long input sequences, particularly in scenarios like streaming contexts where the sequence length grows continuously. This difficulty is a primary motivation for developing alternative memory architectures.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Key-Value (KV) Cache in Transformer Inference
A language model using a standard Transformer architecture is generating a long sequence of text one token at a time. How does the computational effort required to generate the 500th token compare to the effort required for the 10th token?
Diagnosing Memory Issues in a Language Model
Difficulty of Training Transformers on Long Sequences
Evaluating Context Handling in Language Models
Explicit Context Encoding via Additional Memory Models