1Cademy - Diverse and Combined Data Sources for LLM Pre-training

Project Alpha: Aims to train a model on a dataset ten times larger than any previously used, using a well-established architecture that has known limitations with very long text inputs.
Project Beta: Aims to develop a novel model architecture capable of processing entire books as a single input, but due to the experimental nature and computational cost of this new design, it will be trained on a standard-sized, existing dataset.

Learn Before

Core Topics in LLM Development and Scaling
Data Diversity as a Key Issue in LLM Training

Concept

Diverse and Combined Data Sources for LLM Pre-training

To achieve strong performance, Large Language Models are typically pre-trained on combined datasets that draw from a wide variety of sources. Beyond large-scale web-scraped data, these corpora often integrate materials such as books, scientific papers, and user-generated content from social media to ensure a diverse training environment.

Updated 2026-05-02

Contributors are: