Learn Before
Rearrangement of the Assumed DPO Objective
To isolate the variable , the assumed Direct Preference Optimization (DPO) objective function is mathematically rearranged. By manipulating the formula, the target policy term can be separated from the fixed reference terms. The objective is transformed into the expected difference between the log-probability of the target policy and a fixed function of :
This formulation expresses the objective as a difference involving log-probability functions, paving the way for it to be interpreted as a divergence between distributions.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Analyzing the Regularization Parameter in Policy Optimization
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:
min E[-reward + β * (log_prob_policy - log_prob_reference)]If the hyperparameter
βis set to an extremely large positive value, what is the most likely outcome for the optimized policy?A language model is being trained to minimize the following objective function:
Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]During one training step, the current policy
π_θgenerates a responseythat is highly creative and receives a very highreward(x, y). However, this response is stylistically very different from the typical outputs of the reference policyπ_θ_ref, resulting in a very low probabilityπ_θ_ref(y|x). Assumingβis a positive constant, how does this specific generation(x, y)influence the two main components of the objective function for this step?Rearrangement of the Assumed DPO Objective
Unnormalized Target Distribution in the DPO Objective