Case Study

Evaluating a Data Cleaning Strategy for LLM Training

A team is preparing a text corpus from a large web crawl to train a general-purpose, helpful chatbot. Their data cleaning plan consists of the following steps:

  1. Discard any document with fewer than 50 words.
  2. Remove any document that contains a word from a predefined list of 100 common English profanities.
  3. Keep only documents identified as being in the English language.

Critically evaluate this data cleaning strategy. Identify one significant weakness in this plan and explain what negative consequences it could have on the final trained model's performance.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science