1Cademy - Computational Infeasibility of Standard Transformers for Long Sequences

Project Alpha: Aims to train a model on a dataset ten times larger than any previously used, using a well-established architecture that has known limitations with very long text inputs.
Project Beta: Aims to develop a novel model architecture capable of processing entire books as a single input, but due to the experimental nature and computational cost of this new design, it will be trained on a standard-sized, existing dataset.

Learn Before

Core Topics in LLM Development and Scaling
Computational Cost of Self-Attention in Transformers

Problem

Computational Infeasibility of Standard Transformers for Long Sequences

The standard Transformer architecture is fundamentally ill-suited for processing very long sequences due to its high computational demands. The core issue is the self-attention mechanism, whose computational cost grows quadratically with sequence length. This quadratic scaling makes it practically infeasible to both train and deploy models on extremely long inputs.

Updated 2026-05-02

Contributors are: