Learn Before
Risks of Using Unfiltered Web Data for LLM Training
Directly using raw, unfiltered text from the internet for training LLMs is problematic. Web-scraped data, a major source for training corpora, often contains factual errors, inappropriate or toxic content, and fabricated information. The increasing prevalence of AI-generated text online further complicates the challenge of ensuring data integrity.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Risks of Using Unfiltered Web Data for LLM Training
Data Filtering and Cleaning in the LLM Training Workflow
A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?
Challenges of Using Web-Scraped Data for LLM Training
Harm of Training LLMs on Unfiltered Data
Data Filtering and Cleaning to Improve Quality
Analyzing Chatbot Performance Issues
Consequences of Unfiltered Training Data
Learn After
LLM Training Data Strategy
Critique of Unfiltered Data Training Strategy
A development team decides to train a new large language model using a vast, unfiltered corpus of text scraped directly from the public internet. Which of the following is the most significant and direct risk associated with this data collection strategy?