1Cademy - Data Selection and Filtering using Small Models

Learn Before

LLM Training and Fine-Tuning

Activity (Process)

Data Selection and Filtering using Small Models

A method for curating training datasets for Large Language Models involves using a smaller, auxiliary model to evaluate each data sequence. This small model calculates metrics such as likelihood or cross-entropy for the sequences. These metrics then serve as a basis for data selection: sequences that are poorly aligned with the small model's learned distribution (e.g., having low likelihood or high cross-entropy) can be filtered out, while sequences that are well-aligned (e.g., having high likelihood or low cross-entropy) can be prioritized. This technique helps focus the training of the larger model on more relevant or higher-quality data.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After