Short Answer

Rationale for Diverse LLM Training Data

A research team is building a new large language model. One member argues for training it solely on a massive dataset of web pages because it's the largest and most readily available source. Another member argues for supplementing the web data with smaller, more curated datasets of books and academic papers, even though this will be more costly and time-consuming. Explain why the second team member's approach is likely to result in a more capable and robust model.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science