Google

The Direct Policy Optimization (DPO) method is founded on a training objective that conceptually relies on a reward model, `r(x, y)`, to assess the quality of a given output `y` for an input `x`. Although DPO's final formulation bypasses the need for an explicitly trained reward model, this underlying assumption of a reward function is a critical starting point for its derivation.

Conceptual Reward Model in DPO's Training Objective

The mathematical formulation for Direct Policy Optimization (DPO) is derived from a principle that involves optimizing a language model's policy against a conceptual reward model. Considering DPO's final implementation, what is the actual role of this reward model?

The mathematical derivation of the Direct Policy Optimization (DPO) objective function is completely independent of the theoretical concept of a reward model.

A colleague states, "I don't understand why we discuss a reward model when deriving the Direct Policy Optimization (DPO) objective, especially since the final algorithm doesn't require training one." In your own words, clarify the role of the conceptual reward model in the derivation of DPO's training objective and explain why it is not an explicit component of the final implementation.

The Role of the Conceptual Reward Model in DPO

Before explicitly deriving the Direct Preference Optimization (DPO) objective, the method conceptually assumes a foundational policy training objective where the quality of an output $$\mathbf{y}$$ given an input $$\mathbf{x}$$ is evaluated by a theoretical reward model $$r(\mathbf{x}, \mathbf{y})$$. The goal is to find optimal parameters $$\tilde{\theta}$$ by minimizing a loss term (the negative reward, $$-r(\mathbf{x}, \mathbf{y})$$) and a penalty term that regularizes the target policy $$\pi_\theta$$ against a reference policy $$\pi_{\theta_{\text{ref}}}$$. The assumed training objective is given by:

$$\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]$$

Learn Before

Related