Short Answer

Limitations of Small Model Data Filtering

A team is using a small, pre-trained language model to filter a large, diverse dataset for training a much larger model. They decide to keep only the data points for which the small model shows the lowest cross-entropy. Describe one significant potential risk or drawback of this data curation strategy.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science