Data Quality as a Key Issue in LLM Training
The quality of training data is a fundamental issue in the development of data-driven NLP systems, and it is especially critical for Large Language Models. Using raw text directly from various sources is generally undesirable, as research has shown that training on unfiltered data can be harmful to the model's performance and reliability.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Data Quality as a Key Issue in LLM Training
Data Diversity as a Key Issue in LLM Training
Data Bias as a Key Issue in LLM Training
Privacy Concerns in LLM Data Collection
Architectural Modifications for Trainable LLMs
Model Modification for Large-Scale Training
Distributed Training for LLMs
Evaluating a Large-Scale Model Training Plan
A team is developing a new large-scale language model and encounters several distinct challenges. Match each challenge with the primary technical area that needs to be addressed to solve it.
Prioritizing Challenges in Large-Scale Model Training
Data Preparation for Large-Scale LLM Training
Data Quality as a Key Issue in LLM Training
Analyzing a Data Preparation Pipeline
A team is preparing a massive text dataset for training a new large language model. Arrange the following key data preparation stages into the most logical and efficient sequence.
A research team is preparing a massive, diverse dataset scraped from the web to train a large language model. They are primarily concerned with two potential issues: training instability and the model learning undesirable social biases from the raw data. Which of the following data preparation strategies would most directly and effectively address both of these concerns?
Learn After
Risks of Using Unfiltered Web Data for LLM Training
Data Filtering and Cleaning in the LLM Training Workflow
A machine learning team is developing a new large-scale text-generating model. They must choose between two potential training datasets. Dataset A contains 5 terabytes of raw, unfiltered text scraped from a wide variety of public websites. Dataset B contains 1 terabyte of text that has been carefully curated, cleaned for errors, and filtered to remove undesirable content. Given that the primary goal is to create a reliable and high-performing model, which of the following is the most justifiable decision?
Challenges of Using Web-Scraped Data for LLM Training
Harm of Training LLMs on Unfiltered Data
Data Filtering and Cleaning to Improve Quality
Analyzing Chatbot Performance Issues
Consequences of Unfiltered Training Data