Comparison of DPO and RLHF Loss Functions
The loss function used in Direct Preference Optimization (DPO) shares a structural similarity with the pairwise ranking loss used for training reward models in Reinforcement Learning from Human Feedback (RLHF), as both minimize the negative log-probability of a preference based on the Bradley-Terry model. However, the fundamental difference is their target of optimization: the DPO loss function directly depends on and updates the parameters of the language model policy (), whereas the RLHF loss function depends on and updates the parameters of a separate reward model ().
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model is being fine-tuned using a dataset of prompts (x), preferred responses (y_a), and dispreferred responses (y_b). The training objective is to minimize the following loss function:
In this framework, the probability that response y_a is preferred over y_b, denoted as , is computed directly from the likelihoods of each response under the current policy being trained and a fixed reference policy.
Based on this formulation, what is the most significant advantage of this training approach?
Consider a single data point
(x, y_a, y_b)from a preference dataset, wherey_ais the preferred response andy_bis the dispreferred response. In a training framework that directly optimizes a policyπ_θagainst a fixed reference policyπ_refby maximizing the log-probability of the preference data, if the policyπ_θcurrently assigns an equal likelihood to both responses (i.e.,π_θ(y_a|x) = π_θ(y_b|x)), the loss contribution from this data point will be zero.Analysis of Policy Alignment with Preference Data
Comparison of DPO and RLHF Loss Functions