Concept

Risks of Using Unfiltered Web Data for LLM Training

Directly using raw, unfiltered text from the internet for training LLMs is problematic. Web-scraped data, a major source for training corpora, often contains factual errors, inappropriate or toxic content, and fabricated information. The increasing prevalence of AI-generated text online further complicates the challenge of ensuring data integrity.

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences