Learn Before
Unnormalized Target Distribution in the DPO Objective
In the rearranged Direct Preference Optimization (DPO) objective function, the fixed term that does not depend on the target policy , specifically , is interpreted as an unnormalized probability distribution of . This conceptual shift is introduced because evaluating the objective function as a divergence between two valid probability distributions is mathematically more intuitive. To formally convert this unnormalized function into a normalized probability distribution, it must be divided by a normalization factor.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Analyzing the Regularization Parameter in Policy Optimization
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:
min E[-reward + β * (log_prob_policy - log_prob_reference)]If the hyperparameter
βis set to an extremely large positive value, what is the most likely outcome for the optimized policy?A language model is being trained to minimize the following objective function:
Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]During one training step, the current policy
π_θgenerates a responseythat is highly creative and receives a very highreward(x, y). However, this response is stylistically very different from the typical outputs of the reference policyπ_θ_ref, resulting in a very low probabilityπ_θ_ref(y|x). Assumingβis a positive constant, how does this specific generation(x, y)influence the two main components of the objective function for this step?Rearrangement of the Assumed DPO Objective
Unnormalized Target Distribution in the DPO Objective