1Cademy - Comparison of DPO and RLHF Loss Functions

Learn Before

Direct Preference Optimization (DPO) Loss Function

Comparison

Comparison of DPO and RLHF Loss Functions

The loss function used in Direct Preference Optimization (DPO) shares a structural similarity with the pairwise ranking loss used for training reward models in Reinforcement Learning from Human Feedback (RLHF), as both minimize the negative log-probability of a preference based on the Bradley-Terry model. However, the fundamental difference is their target of optimization: the DPO loss function directly depends on and updates the parameters of the language model policy ( $\theta$ ), whereas the RLHF loss function depends on and updates the parameters of a separate reward model ( $\phi$ ).

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related