Case Study

LLM Training Data Strategy Evaluation

A research lab is developing a new large language model intended for general-purpose tasks, including factual question-answering, text summarization, and creative writing. They have two primary options for sourcing their pre-training data, given their budget constraints:

  • Option A: License a single, massive 10-terabyte dataset consisting exclusively of filtered web page content.
  • Option B: Curate a smaller, 5-terabyte dataset by combining web page content with digitized books and a collection of scientific research papers.

Which data sourcing option would you recommend for achieving the strongest overall performance in the resulting model? Justify your recommendation by evaluating the trade-offs between the two options.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science