Concept

Common Data Sources for Pre-training LLMs

The pre-training of large language models relies on vast and varied text corpora. Key sources for these datasets include webpages, books, conversational text, software code, Wikipedia, and news articles, in addition to other materials like scientific papers and content from question-and-answer (Q&A) platforms.

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related