Learn Before
Reference Policy in RLHF
In Reinforcement Learning from Human Feedback (RLHF), the reference policy, denoted as , is a fixed policy used as a baseline during the optimization of the active policy . It is typically a copy of the supervised fine-tuned (SFT) model before the RLHF stage begins. The reference policy's role is to prevent the active policy from deviating too far from the original language style and safety constraints, which is enforced by a penalty term (e.g., KL-divergence) that measures the difference between the two policies.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Objective Function for Policy Learning in RLHF
Use of Proximal Policy Optimization (PPO) in RLHF
Application of A2C in RLHF for LLM Alignment
Role and Definition of the Reference Model in RLHF
Joint Optimization of Policy and Value Functions in RLHF
RLHF Policy Optimization Objective
Reference Policy in RLHF
RLHF Policy Optimization as Loss Minimization
A language model is being fine-tuned using an iterative feedback process. In each step, the model generates a response to a prompt. A separate, pre-trained scoring model then assigns a numerical score to this response based on its quality. What is the most direct and immediate use of this numerical score within a single step of this training loop?
Arrange the following events into the correct chronological order as they would occur within a single iterative step of the policy learning phase for a language model.
Diagnosing a Training Failure in an Iterative Fine-Tuning Process
Direct Preference Optimization (DPO)
Learn After
RLHF Policy Optimization Objective
Policy Divergence Penalty for Language Models
KL-Divergence Penalty in RLHF Policy Optimization
An AI development team is fine-tuning a language model using a reinforcement learning process guided by a reward model. They observe that the model's outputs, while receiving high scores from the reward model, are becoming stylistically unnatural and deviating significantly from the helpful tone established during its initial supervised training. Which of the following adjustments to the training process is most specifically designed to counteract this behavioral drift?
Diagnosing and Mitigating Reward Hacking
Consequences of Omitting a Reference Policy in RLHF