Formula

Conceptual Objective Function Assumed in DPO

Before explicitly deriving the Direct Preference Optimization (DPO) objective, the method conceptually assumes a foundational policy training objective where the quality of an output y\mathbf{y} given an input x\mathbf{x} is evaluated by a theoretical reward model r(x,y)r(\mathbf{x}, \mathbf{y}). The goal is to find optimal parameters θ~\tilde{\theta} by minimizing a loss term (the negative reward, r(x,y)-r(\mathbf{x}, \mathbf{y})) and a penalty term that regularizes the target policy πθ\pi_\theta against a reference policy πθref\pi_{\theta_{\text{ref}}}. The assumed training objective is given by:

θ~=argminθExDEyπθ(x)[r(x,y)loss+β(logπθ(yx)logπθref(yx))penalty]\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences