Evaluating a Data Cleaning Strategy for LLM Training
A team is preparing a text corpus from a large web crawl to train a general-purpose, helpful chatbot. Their data cleaning plan consists of the following steps:
- Discard any document with fewer than 50 words.
- Remove any document that contains a word from a predefined list of 100 common English profanities.
- Keep only documents identified as being in the English language.
Critically evaluate this data cleaning strategy. Identify one significant weakness in this plan and explain what negative consequences it could have on the final trained model's performance.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team trains a large language model on a vast, unprocessed collection of text from the internet. During testing, they find the model frequently produces outputs that are nonsensical, contain harmful stereotypes, and include fabricated information. Which of the following strategies should the team prioritize to most effectively address the root cause of these issues before their next training attempt?
Evaluating a Data Cleaning Strategy for LLM Training
A machine learning team is preparing a large text dataset to train a new language model. Arrange the following data processing steps into a logical and effective sequence.