Learn Before
Challenges of Using Web-Scraped Data for LLM Training
A significant challenge in LLM training stems from the reliance on web-scraped data, which constitutes a large portion of training corpora. This data is often of poor quality, containing various issues such as factual errors, toxic or inappropriate content, and fabricated information, making it unsuitable for direct use.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Risks of Using Unfiltered Web Data for LLM Training
Data Filtering and Cleaning in the LLM Training Workflow
A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?
Challenges of Using Web-Scraped Data for LLM Training
Harm of Training LLMs on Unfiltered Data
Data Filtering and Cleaning to Improve Quality
Analyzing Chatbot Performance Issues
Consequences of Unfiltered Training Data
Learn After
Impact of AI-Generated Content on Data Collection
Evaluating Web-Scraped Text for Training Data
Critique of Unfiltered Web Data for LLM Training
An AI development team is creating a training dataset for a new LLM intended for use in educational settings. They have a large corpus of data scraped from various online forums and blogs. Which of the following data quality issues presents the most critical and immediate challenge to the model's suitability for its intended purpose?
An LLM development team is analyzing a large dataset scraped from the internet. Match each type of data quality issue they might encounter with its most accurate description and impact on the model.