Critique of Unfiltered Web Data for LLM Training
A research team proposes training a new large language model exclusively on a massive, unfiltered dataset scraped from the public internet. Their primary argument is that the sheer volume of data will naturally overcome any isolated quality issues. Evaluate this training strategy. In your evaluation, identify at least two distinct types of problems inherent in such data and explain the potential negative consequences each could have on the resulting model's performance and reliability.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Impact of AI-Generated Content on Data Collection
Evaluating Web-Scraped Text for Training Data
Critique of Unfiltered Web Data for LLM Training
An AI development team is creating a training dataset for a new LLM intended for use in educational settings. They have a large corpus of data scraped from various online forums and blogs. Which of the following data quality issues presents the most critical and immediate challenge to the model's suitability for its intended purpose?
An LLM development team is analyzing a large dataset scraped from the internet. Match each type of data quality issue they might encounter with its most accurate description and impact on the model.