Comparison

Comparison of DPO and RLHF Loss Functions

The loss function used in Direct Preference Optimization (DPO) shares a structural similarity with the pairwise ranking loss used for training reward models in Reinforcement Learning from Human Feedback (RLHF), as both minimize the negative log-probability of a preference based on the Bradley-Terry model. However, the fundamental difference is their target of optimization: the DPO loss function directly depends on and updates the parameters of the language model policy (θ\theta), whereas the RLHF loss function depends on and updates the parameters of a separate reward model (ϕ\phi).

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences