1Cademy - Evaluating a Data Cleaning Strategy for LLM Training

Learn Before

Data Filtering and Cleaning in the LLM Training Workflow

Case Study

Evaluating a Data Cleaning Strategy for LLM Training

A team is preparing a text corpus from a large web crawl to train a general-purpose, helpful chatbot. Their data cleaning plan consists of the following steps:

Discard any document with fewer than 50 words.
Remove any document that contains a word from a predefined list of 100 common English profanities.
Keep only documents identified as being in the English language.

Critically evaluate this data cleaning strategy. Identify one significant weakness in this plan and explain what negative consequences it could have on the final trained model's performance.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related