Learn Before
Analyzing the Regularization Parameter in Policy Optimization
A language model's policy, , is being optimized by minimizing the objective function below. In this function, represents a reward for generating output from input , and is a fixed reference policy.
Analyze the trade-offs involved when setting the hyperparameter to a very high value versus a very low (but non-zero) value. Describe the likely characteristics of the resulting model's behavior in each scenario.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing the Regularization Parameter in Policy Optimization
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:
min E[-reward + β * (log_prob_policy - log_prob_reference)]If the hyperparameter
βis set to an extremely large positive value, what is the most likely outcome for the optimized policy?A language model is being trained to minimize the following objective function:
Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]During one training step, the current policy
π_θgenerates a responseythat is highly creative and receives a very highreward(x, y). However, this response is stylistically very different from the typical outputs of the reference policyπ_θ_ref, resulting in a very low probabilityπ_θ_ref(y|x). Assumingβis a positive constant, how does this specific generation(x, y)influence the two main components of the objective function for this step?Rearrangement of the Assumed DPO Objective
Unnormalized Target Distribution in the DPO Objective