Learn Before
Data Selection and Filtering using Small Models
A method for curating training datasets for Large Language Models involves using a smaller, auxiliary model to evaluate each data sequence. This small model calculates metrics such as likelihood or cross-entropy for the sequences. These metrics then serve as a basis for data selection: sequences that are poorly aligned with the small model's learned distribution (e.g., having low likelihood or high cross-entropy) can be filtered out, while sequences that are well-aligned (e.g., having high likelihood or low cross-entropy) can be prioritized. This technique helps focus the training of the larger model on more relevant or higher-quality data.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Classification of LLM Adaptation Methods
RLHF Policy Optimization as Loss Minimization
A development team is fine-tuning a large language model for a specific task using a dataset of inputs and corresponding correct outputs. During a training iteration, the model produces an output that is very different from the correct target output. What is the immediate, primary function of this discrepancy within the training process?
Direct Supervision via Knowledge Distillation Loss in Weak-to-Strong Generalization
A large language model is undergoing a single step of fine-tuning on a new dataset. Arrange the following events in the correct chronological order to represent this process.
Data Selection and Filtering using Small Models
Diagnosing a Stagnant Fine-Tuning Process
Learn After
Evaluating a Data Curation Strategy for a Specialized Model
A research team is building a large language model specialized in generating high-quality Python code. They have a massive dataset containing a mix of Python code, natural language text, and code from other programming languages. To curate this dataset, they use a smaller, pre-trained model that is already proficient in Python. Which of the following data filtering strategies would be most effective for their goal?
Limitations of Small Model Data Filtering