Filtering in Self-Instruct
In the Self-Instruct framework, newly generated samples are evaluated using heuristic rules before being accepted. A key heuristic involves filtering out samples or instructions that are too similar to those already present in the task pool. Samples that successfully pass this examination are then added to the pool, ensuring the dataset's quality and novelty.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Sample Generation in Self-Instruct
Filtering in Self-Instruct
Task Pool in Self-Instruct
Initialization of the Task Pool in Self-Instruct
Instruction Generation in Self-Instruct
Refining Prompt Templates in Self-Instruct
An AI development team wants to expand a small, manually-created set of instruction-following data into a much larger dataset for fine-tuning a language model. They decide to use the model itself to generate new data in an iterative loop. Which of the following procedures correctly describes the core cycle for generating one new, high-quality data point?
A team is using an iterative method to generate a large dataset for fine-tuning a language model, starting from a small set of examples. Arrange the core steps of a single cycle of this process in the correct order.
Diagnosing a Data Generation Pipeline Issue
Filtering in Self-Instruct
In an automated process for generating training data, a language model has just created a new, unique instruction: 'Write a product description for a fictional gadget.' To complete the data instance for this instruction, what is the essential next task for the model?
Example of a Prompt Template for Sample Generation in Self-Instruct
An automated system for creating training data has just generated a new instruction: 'Summarize the provided text into a single sentence.' In the subsequent step, the system produces the following text: 'The main character overcomes several obstacles to achieve their lifelong dream.' Based on the requirements for creating a complete data instance, what crucial component is missing from this generated sample?
Diagnosing a Flaw in an Automated Data Generation Process
Learn After
A team uses an iterative process to automatically generate a large instruction-tuning dataset, starting from a small set of initial examples. After fine-tuning, the resulting model performs very well on tasks that are nearly identical to the initial examples but fails to generalize to new, unseen types of instructions. What is the most probable deficiency in the data generation pipeline that led to this outcome?
A team is using an iterative method to expand a small set of seed instructions into a large dataset for model training. Arrange the following steps of a single generation cycle in the correct chronological order.
Evaluating a Self-Instruct Filtering Strategy