Learn Before
Likelihood and Cross-Entropy as Data Filtering Criteria
When using a weak model for data selection, the likelihood or cross-entropy of a sequence, as computed by the weak model, serves as a quantitative filtering criterion. Sequences that are less aligned with the weak model's learned distribution, indicated by low likelihood or high cross-entropy, may be excluded from the training set. Conversely, sequences that align well, indicated by high likelihood or low cross-entropy, can be prioritized to focus the training of a large model on more relevant or high-quality data.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Likelihood and Cross-Entropy as Data Filtering Criteria
Weak-to-Strong Generalization via Fine-Tuning on Weak Model Data
Optimizing Training Data for a Medical Language Model
A team is preparing a large, diverse text dataset to train a powerful new language model. To improve the final model's quality, they first use a smaller, pre-existing language model to score each document in the dataset. Documents that receive a very low score from this smaller model are removed. Which of the following documents is most likely to be removed from the dataset during this filtering process?
You are tasked with curating a high-quality dataset for training a large language model. You decide to use a smaller, less powerful model to help filter an initial, large collection of text documents. Arrange the following steps of this data filtering process in the correct logical order.
Learn After
Data Filtering for a Specialized Language Model
A team is refining a large, general-purpose text corpus to train a specialized language model. They use a smaller, pre-existing model to calculate the cross-entropy for each document. Their goal is to create a high-quality, coherent, and well-structured training set. Which of the following filtering strategies should they implement and why?
Interpreting Cross-Entropy for Data Curation