Learn Before
Harm of Training LLMs on Unfiltered Data
According to research, training Large Language Models on unfiltered data has a detrimental effect. The presence of low-quality content, such as errors and toxic information, in the training dataset can negatively impact the resulting model's performance and reliability.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Risks of Using Unfiltered Web Data for LLM Training
Data Filtering and Cleaning in the LLM Training Workflow
A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?
Challenges of Using Web-Scraped Data for LLM Training
Harm of Training LLMs on Unfiltered Data
Data Filtering and Cleaning to Improve Quality
Analyzing Chatbot Performance Issues
Consequences of Unfiltered Training Data
Learn After
LLM Training Data Strategy Evaluation
A development team is building a new large language model and decides to use a massive dataset scraped directly from the public internet. They apply no filtering, cleaning, or quality control measures to this data. Based on established research, which of the following is the most likely and direct consequence of this approach?
Analyzing the Impact of Unfiltered Training Data