Based on the plan described in the case study, evaluate the startup's data strategy. Identify the most significant potential flaw in their reasoning and explain at least two specific, negative outcomes that could result from training their chatbot on this type of unfiltered data.

Google

Directly using raw, unfiltered text from the internet for training LLMs is problematic. Web-scraped data, a major source for training corpora, often contains factual errors, inappropriate or toxic content, and fabricated information. The increasing prevalence of AI-generated text online further complicates the challenge of ensuring data integrity.

Risks of Using Unfiltered Web Data for LLM Training

LLM Training Data Strategy

A technology startup argues that training their new large language model on a massive, completely unfiltered dataset scraped from the internet will give it the "most comprehensive and unbiased view of humanity." Evaluate this argument. In your response, identify at least three distinct types of problematic content found in such data and explain the potential negative consequences of each for the model's final behavior and utility.

Critique of Unfiltered Data Training Strategy

A development team decides to train a new large language model using a vast, unfiltered corpus of text scraped directly from the public internet. Which of the following is the most significant and direct risk associated with this data collection strategy?

Learn Before

Related