Essay

Critique of Unfiltered Web Data for LLM Training

A research team proposes training a new large language model exclusively on a massive, unfiltered dataset scraped from the public internet. Their primary argument is that the sheer volume of data will naturally overcome any isolated quality issues. Evaluate this training strategy. In your evaluation, identify at least two distinct types of problems inherent in such data and explain the potential negative consequences each could have on the resulting model's performance and reliability.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science