Essay

Evaluating Data Sources for LLM Pre-training

A technology company is developing a new large language model intended for public use. A significant portion of its pre-training data is sourced from a massive crawl of the public internet, including forums and social media. Critically evaluate the decision to rely heavily on this type of data. In your response, you must weigh the potential benefits for the model's general capabilities against the significant ethical and performance-related risks.

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science