Learn Before
Limitations of Small Model Data Filtering
A team is using a small, pre-trained language model to filter a large, diverse dataset for training a much larger model. They decide to keep only the data points for which the small model shows the lowest cross-entropy. Describe one significant potential risk or drawback of this data curation strategy.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Data Curation Strategy for a Specialized Model
A research team is building a large language model specialized in generating high-quality Python code. They have a massive dataset containing a mix of Python code, natural language text, and code from other programming languages. To curate this dataset, they use a smaller, pre-trained model that is already proficient in Python. Which of the following data filtering strategies would be most effective for their goal?
Limitations of Small Model Data Filtering