Prioritizing Influential Data for Fine-Tuning
An effective data selection technique involves prioritizing training samples that exert the most significant influence on the fine-tuning process. This method focuses on curating a dataset composed of the most impactful examples to enhance training efficiency and model performance.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Small Model-Based Data Selection
Heuristics-Based Data Filtering for Fine-Tuning
Prioritizing Influential Data for Fine-Tuning
A development team fine-tunes a large language model on a massive, newly-generated dataset of 1 million instruction-response pairs. After training, they find the model's performance is poor, often generating repetitive, nonsensical, or factually incorrect answers. Which of the following is the most likely root cause of this issue and the best initial strategy to address it?
Evaluating a Data Filtering Strategy
A team is preparing a large, synthetically-generated dataset for fine-tuning a language model. They suspect the dataset has several quality issues. Match each potential data quality problem with the primary goal of a filtering method designed to address it.
Your company is rolling out an instruction-tuned L...
You lead an LLM enablement team building an instru...
You’re leading an LLM platform team building an in...
Your company is building an internal IT helpdesk a...
Deciding Whether (and How) to Use Weak-Model Synthetic Data for Instruction Fine-Tuning
Diagnosing and Fixing a Synthetic Instruction-Tuning Data Flywheel That Degrades Model Behavior
Designing a Synthetic Instruction Fine-Tuning Pipeline Under Budget and Quality Constraints
Stabilizing an Instruction-Tuned Support Assistant When Synthetic Data Conflicts with Human Policy
Selecting and Filtering Self-Generated Instruction Data When Bootstrapping a Strong Model from a Weak Supervisor
Choosing a Weak-Model + Self-Instruct Data Strategy for Instruction Fine-Tuning Without Regressions
Efficiency Benefits of Data Selection in Fine-Tuning
Alpagasus Data Selection System
Learn After
A machine learning team is tasked with fine-tuning a general-purpose language model to specialize in summarizing complex scientific research papers. The team has access to a massive dataset of papers but has a very limited budget for computation, allowing them to use only a small fraction of the available data for training. Their primary objective is to achieve the highest possible summarization quality given these constraints. Which data selection strategy should the team prioritize to most effectively achieve their goal?
Fine-Tuning Efficiency and Performance
Trade-offs in Data Curation for Model Training