Learn Before
  • Data Quality as a Key Issue in LLM Training

Data Filtering and Cleaning in the LLM Training Workflow

To address the challenges of poor data quality, the standard workflow for preparing LLM training data includes essential filtering and cleaning steps. This data processing is crucial for improving the overall quality and reliability of the text corpus used to train the model.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Risks of Using Unfiltered Web Data for LLM Training

  • Data Filtering and Cleaning in the LLM Training Workflow

  • A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?

  • Challenges of Using Web-Scraped Data for LLM Training

  • Harm of Training LLMs on Unfiltered Data

  • Data Filtering and Cleaning to Improve Quality

  • Analyzing Chatbot Performance Issues

  • Consequences of Unfiltered Training Data

Learn After
  • A development team trains a large language model on a vast, unprocessed collection of text from the internet. During testing, they find the model frequently produces outputs that are nonsensical, contain harmful stereotypes, and include fabricated information. Which of the following strategies should the team prioritize to most effectively address the root cause of these issues before their next training attempt?

  • Evaluating a Data Cleaning Strategy for LLM Training

  • A machine learning team is preparing a large text dataset to train a new language model. Arrange the following data processing steps into a logical and effective sequence.