Learn Before
Data Filtering and Cleaning to Improve Quality
To address the problem of poor data quality, a common practice is to integrate filtering and cleaning steps into the data processing workflow. These procedures are designed to refine the raw text by removing errors, inappropriate content, and other undesirable elements before the data is used for model training.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Risks of Using Unfiltered Web Data for LLM Training
Data Filtering and Cleaning in the LLM Training Workflow
A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?
Challenges of Using Web-Scraped Data for LLM Training
Harm of Training LLMs on Unfiltered Data
Data Filtering and Cleaning to Improve Quality
Analyzing Chatbot Performance Issues
Consequences of Unfiltered Training Data
Learn After
A data scientist is preparing a large text corpus scraped from public internet forums to train a general-purpose chatbot. To improve data quality, they apply a filter that automatically deletes any text segment containing words from a predefined list of profanities. Which statement provides the most accurate evaluation of this data cleaning strategy?
Refining a Customer Service Chatbot Dataset
You are tasked with creating a data processing pipeline to clean a large, raw text corpus for training a language model. Arrange the following cleaning steps into the most logical and efficient order.