1Cademy - LLM Training Data Strategy Evaluation

Learn Before

Impact of Combined Datasets on LLM Performance

Case Study

LLM Training Data Strategy Evaluation

A research lab is developing a new large language model intended for general-purpose tasks, including factual question-answering, text summarization, and creative writing. They have two primary options for sourcing their pre-training data, given their budget constraints:

Option A: License a single, massive 10-terabyte dataset consisting exclusively of filtered web page content.
Option B: Curate a smaller, 5-terabyte dataset by combining web page content with digitized books and a collection of scientific research papers.

Which data sourcing option would you recommend for achieving the strongest overall performance in the resulting model? Justify your recommendation by evaluating the trade-offs between the two options.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related