1Cademy - Reference Policy in DPOs Penalty Term

Learn Before

Direct Preference Optimization (DPO)

Definition

Reference Policy in DPO's Penalty Term

In the Direct Policy Optimization (DPO) training objective, the penalty term utilizes a reference policy, denoted as $\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$ . This policy is typically a fixed, supervised fine-tuned version of the language model that serves as a stable baseline. The penalty term's function is to regularize the optimized policy, $\pi_\theta$ , discouraging it from deviating significantly from this reference, which helps maintain response quality and training stability.