Learn Before
Rationale for Diverse LLM Training Data
A research team is building a new large language model. One member argues for training it solely on a massive dataset of web pages because it's the largest and most readily available source. Another member argues for supplementing the web data with smaller, more curated datasets of books and academic papers, even though this will be more costly and time-consuming. Explain why the second team member's approach is likely to result in a more capable and robust model.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is comparing two large language models. Model 'Helios' was trained exclusively on a massive dataset of text and code scraped from the public internet. Model 'Selene' was trained on a carefully curated dataset that combines a similar internet scrape with a vast library of digitized books and peer-reviewed academic journals. Based on their training data, which statement provides the most accurate analysis of their likely capabilities?
LLM Training Data Strategy Evaluation
Rationale for Diverse LLM Training Data