1Cademy - Data Selection and Filtering Using Weak Models

Learn Before

Utility of Weak Models in Assisting Stronger Models

Activity (Process)

Data Selection and Filtering Using Weak Models

A method for curating training data for large language models involves leveraging a smaller, weaker model. This process entails calculating metrics such as likelihood or cross-entropy for each data sequence using the weak model. These metrics then serve as a basis for selection criteria, allowing for the filtering of less suitable data or the prioritization of high-quality data during pre-training or fine-tuning.

Updated 2026-05-01

Contributors are: