Data Selection and Filtering Using Weak Models
A method for curating training data for large language models involves leveraging a smaller, weaker model. This process entails calculating metrics such as likelihood or cross-entropy for each data sequence using the weak model. These metrics then serve as a basis for selection criteria, allowing for the filtering of less suitable data or the prioritization of high-quality data during pre-training or fine-tuning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of Successful Weak-to-Strong Generalization: GPT-4 with GPT-2 Supervision
Weak Performance (Pweak) as a Baseline Metric
Weak-to-Strong Performance (Pweak→strong)
Strong Ceiling Performance (Pceiling)
Performance Gap Recovered (PGR)
Data Selection and Filtering Using Weak Models
Cascading Inference
Weak-to-Strong Generalization via Fine-Tuning on Weak Model Data
AI System Optimization Strategy
An AI development team is building a system to answer a very high volume of customer support queries. They implement a two-step process: first, a small, fast model attempts to answer each query. If this model's confidence in its answer is low, the query is then passed to a much larger, more powerful, but slower model. What is the most significant strategic advantage of this architectural choice?
Direct Supervision via Knowledge Distillation Loss in Weak-to-Strong Generalization
When a large, powerful computational model is trained using labels generated exclusively by a smaller, less accurate model, the performance of the large model on new, unseen data is fundamentally limited and cannot exceed the accuracy of the smaller model that provided the training labels.
Using Small Models for Pre-training or Fine-Tuning
Combining Small and Large Models
Learn After
Likelihood and Cross-Entropy as Data Filtering Criteria
Weak-to-Strong Generalization via Fine-Tuning on Weak Model Data
Optimizing Training Data for a Medical Language Model
A team is preparing a large, diverse text dataset to train a powerful new language model. To improve the final model's quality, they first use a smaller, pre-existing language model to score each document in the dataset. Documents that receive a very low score from this smaller model are removed. Which of the following documents is most likely to be removed from the dataset during this filtering process?
You are tasked with curating a high-quality dataset for training a large language model. You decide to use a smaller, less powerful model to help filter an initial, large collection of text documents. Arrange the following steps of this data filtering process in the correct logical order.