Concept

Challenges of Using Web-Scraped Data for LLM Training

A significant challenge in LLM training stems from the reliance on web-scraped data, which constitutes a large portion of training corpora. This data is often of poor quality, containing various issues such as factual errors, toxic or inappropriate content, and fabricated information, making it unsuitable for direct use.

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences