Learn Before
Rejection Sampling for LLM Fine-Tuning
Rejection sampling is a technique for fine-tuning Large Language Models by incorporating human preferences. The process involves generating a list of N-best outputs, using a reward model to identify the highest-quality responses from this list, and then using this curated set of 'best' outputs as the data for fine-tuning the LLM.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Input and Output Formulation in BoN Sampling
Generating N-Best Candidates in BoN Sampling
Reward Model Selection in BoN Sampling
Rejection Sampling for LLM Fine-Tuning
A company wants to improve the safety and helpfulness of its AI assistant without the high cost and time of retraining the entire base model. They propose a new system for handling user queries: for each query, the system will first generate 10 different potential responses. Then, a separate, fast-acting 'quality-scoring' model will evaluate all 10 responses based on pre-defined criteria. Finally, the system will present only the single response that received the highest score to the user. What is the most significant trade-off of this approach compared to simply using the first response the base model generates?
A system is designed to improve the quality of its generated text by producing multiple options and then picking the best one. Arrange the following steps of this process in the correct logical order.
Chatbot Response Quality Improvement
Learn After
Comparison of Rejection Sampling and RLHF
Adoption of Rejection Sampling in LLMs
Analyzing a Flawed Model Improvement Pipeline
You are tasked with improving a language model's ability to generate helpful and harmless responses. You decide to use a method that involves generating multiple potential responses to a prompt, scoring them with a separate quality-assessment model, and then using only the best-scoring responses to further train the original model. Arrange the following steps of this process in the correct logical order.
A machine learning team wants to improve a base language model's ability to follow instructions. They have already trained a separate, reliable 'reward model' that can score the quality of any given response. The team wants to use this reward model to enhance the base model's performance directly through a data-centric approach, avoiding more complex training paradigms. Which of the following strategies correctly describes the most effective and direct way to use the reward model for this purpose?