1Cademy - Reference Policy in RLHF

Learn Before

Policy Learning in RLHF

Definition

Reference Policy in RLHF

In Reinforcement Learning from Human Feedback (RLHF), the reference policy, denoted as $\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$ , is a fixed policy used as a baseline during the optimization of the active policy $\pi_{\theta}$ . It is typically a copy of the supervised fine-tuned (SFT) model before the RLHF stage begins. The reference policy's role is to prevent the active policy from deviating too far from the original language style and safety constraints, which is enforced by a penalty term (e.g., KL-divergence) that measures the difference between the two policies.