Learn Before
LLM Training Data Strategy Evaluation
A research lab is developing a new large language model intended for general-purpose tasks, including factual question-answering, text summarization, and creative writing. They have two primary options for sourcing their pre-training data, given their budget constraints:
- Option A: License a single, massive 10-terabyte dataset consisting exclusively of filtered web page content.
- Option B: Curate a smaller, 5-terabyte dataset by combining web page content with digitized books and a collection of scientific research papers.
Which data sourcing option would you recommend for achieving the strongest overall performance in the resulting model? Justify your recommendation by evaluating the trade-offs between the two options.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is comparing two large language models. Model 'Helios' was trained exclusively on a massive dataset of text and code scraped from the public internet. Model 'Selene' was trained on a carefully curated dataset that combines a similar internet scrape with a vast library of digitized books and peer-reviewed academic journals. Based on their training data, which statement provides the most accurate analysis of their likely capabilities?
LLM Training Data Strategy Evaluation
Rationale for Diverse LLM Training Data