1Cademy - Likelihood and Cross-Entropy as Data Filtering Criteria

Learn Before

Data Selection and Filtering Using Weak Models

Concept

Likelihood and Cross-Entropy as Data Filtering Criteria

When using a weak model for data selection, the likelihood or cross-entropy of a sequence, as computed by the weak model, serves as a quantitative filtering criterion. Sequences that are less aligned with the weak model's learned distribution, indicated by low likelihood or high cross-entropy, may be excluded from the training set. Conversely, sequences that align well, indicated by high likelihood or low cross-entropy, can be prioritized to focus the training of a large model on more relevant or high-quality data.

Updated 2025-10-06

Contributors are: