1Cademy - Elimination of the Reward Model in DPO

Learn Before

Derivation of DPO Preference Probability from Policy Ratios

Concept

Elimination of the Reward Model in DPO

A key advantage of Direct Policy Optimization (DPO) is the elimination of an explicit reward model during training. This is a direct consequence of re-expressing the preference probability in terms of policy ratios, a formulation where the intractable normalization factor Z(x) cancels out. As a result, preference probabilities can be calculated using only the target and reference policies, which streamlines the alignment process.

Updated 2026-01-15

Contributors are: