Learn Before
Language Model Scaling Problem
A development team has successfully built a language model using a standard self-attention architecture. The model performs well when processing texts up to 512 tokens in length. However, when they attempt to use the exact same architecture to process legal documents that are 8192 tokens long, they consistently encounter 'out-of-memory' errors, and the processing time for a single document becomes prohibitively long. Based on the computational properties of the model's core mechanism, what is the fundamental reason for this dramatic failure to scale?
0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Taxonomy of Efficient Transformers
High-Performance Computing Improvements for Transformers
Language Model Scaling Problem
Developing Efficient Architectures and Training for Long-Sequence Self-Attention
A startup with a limited computational budget is tasked with building a system to analyze and summarize entire books for a digital library. A key requirement is that the model must process the full context of these very long documents simultaneously. Why would a standard transformer architecture be a poor choice for this specific task, and what is the implication for model selection?
Scaling Limitations of Standard Transformers