Learn Before
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:
min E[-reward + β * (log_prob_policy - log_prob_reference)]
If the hyperparameter β is set to an extremely large positive value, what is the most likely outcome for the optimized policy?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing the Regularization Parameter in Policy Optimization
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:
min E[-reward + β * (log_prob_policy - log_prob_reference)]If the hyperparameter
βis set to an extremely large positive value, what is the most likely outcome for the optimized policy?A language model is being trained to minimize the following objective function:
Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]During one training step, the current policy
π_θgenerates a responseythat is highly creative and receives a very highreward(x, y). However, this response is stylistically very different from the typical outputs of the reference policyπ_θ_ref, resulting in a very low probabilityπ_θ_ref(y|x). Assumingβis a positive constant, how does this specific generation(x, y)influence the two main components of the objective function for this step?Rearrangement of the Assumed DPO Objective
Unnormalized Target Distribution in the DPO Objective