1Cademy - Direct Preference Optimization (DPO) Loss Function

Learn Before

Derivation of DPO Preference Probability from Policy Ratios

Formula

Direct Preference Optimization (DPO) Loss Function

The Direct Preference Optimization (DPO) method trains the target policy by minimizing a specific loss function that relies directly on preference data. It computes the negative log-likelihood of preference probabilities directly from the target and reference policies, skipping the intermediate reward model entirely. Over a preference dataset $\mathcal{D}_r$ , the objective is formulated as:

$\mathcal{L}_{\mathrm{dpo}}(\theta) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\theta}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big]$

By minimizing this loss, the policy parameters $\theta$ are optimized to align with human preferences.

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After