Definition

Reference Policy in RLHF

In Reinforcement Learning from Human Feedback (RLHF), the reference policy, denoted as πθref(yx)\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}), is a fixed policy used as a baseline during the optimization of the active policy πθ\pi_{\theta}. It is typically a copy of the supervised fine-tuned (SFT) model before the RLHF stage begins. The reference policy's role is to prevent the active policy from deviating too far from the original language style and safety constraints, which is enforced by a penalty term (e.g., KL-divergence) that measures the difference between the two policies.

Image 0

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences