Learn Before
Role and Definition of the Reference Model in RLHF
In RLHF, the reference model, with parameters denoted by , serves as the baseline Large Language Model that provides the starting point for policy training. This model is typically a prior version of the LLM being trained or a model fine-tuned without human feedback, such as an SFT model. During the policy training phase, the reference model has two key functions: it is used to perform sampling across the range of possible outputs, and it is a component in the loss calculation, helping to regulate the policy updates.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Objective Function for Policy Learning in RLHF
Use of Proximal Policy Optimization (PPO) in RLHF
Application of A2C in RLHF for LLM Alignment
Role and Definition of the Reference Model in RLHF
Joint Optimization of Policy and Value Functions in RLHF
RLHF Policy Optimization Objective
Reference Policy in RLHF
RLHF Policy Optimization as Loss Minimization
A language model is being fine-tuned using an iterative feedback process. In each step, the model generates a response to a prompt. A separate, pre-trained scoring model then assigns a numerical score to this response based on its quality. What is the most direct and immediate use of this numerical score within a single step of this training loop?
Arrange the following events into the correct chronological order as they would occur within a single iterative step of the policy learning phase for a language model.
Diagnosing a Training Failure in an Iterative Fine-Tuning Process
Direct Preference Optimization (DPO)
Learn After
During the fine-tuning of a language model using a reward signal, a team observes that the model's outputs are becoming nonsensical, even though they receive high reward scores. The model is essentially 'gaming' the reward system. Which component in this training setup is specifically intended to mitigate this issue by penalizing the model for deviating too far from its initial, coherent language patterns?
Diagnosing Training Stagnation in a Reward-Based System
Evaluating Reference Model Selection in Reward-Based Training