Learn Before
Definition

Reference Policy in DPO's Penalty Term

In the Direct Policy Optimization (DPO) training objective, the penalty term utilizes a reference policy, denoted as πθref(yx)\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}). This policy is typically a fixed, supervised fine-tuned version of the language model that serves as a stable baseline. The penalty term's function is to regularize the optimized policy, πθ\pi_\theta, discouraging it from deviating significantly from this reference, which helps maintain response quality and training stability.

Image 0

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related