Essay

Strategic Use of Synthetic Data in LLM Pre-training

A research lab is developing a new large language model and has a massive corpus of human-generated text. They are considering two strategies to augment this corpus with synthetically generated data for pre-training:

  1. Domain Expansion: Generating text on specialized, low-resource topics (e.g., advanced theoretical physics, ancient legal codes) that are underrepresented in their original corpus.
  2. Reasoning Augmentation: Generating complex, multi-step reasoning problems and their detailed solutions (e.g., mathematical proofs, logical puzzles).

Analyze the potential benefits and primary risks associated with each of these two strategies for the foundational capabilities of the resulting model.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science