Learn Before
Analyzing the Impact of Unfiltered Training Data
A machine learning team is training a new large language model. One engineer suggests using a massive, unfiltered dataset from the internet to maximize the amount of training data. Explain two distinct types of low-quality content likely present in this unfiltered dataset and describe the specific negative impact each type could have on the final model's behavior.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
LLM Training Data Strategy Evaluation
A development team is building a new large language model and decides to use a massive dataset scraped directly from the public internet. They apply no filtering, cleaning, or quality control measures to this data. Based on established research, which of the following is the most likely and direct consequence of this approach?
Analyzing the Impact of Unfiltered Training Data