Learn Before
Stabilizing Policy Optimization Training
Given the following case study, identify the likely cause of the training instability and propose a specific adjustment to the objective function to resolve it. Explain your reasoning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a policy optimization method, a composite objective function is used, defined as
Objective = Clipped_Surrogate_Objective - β * Divergence_Penalty. This function balances maximizing the primary objective with a penalty for how much the policy changes. What is the most likely consequence of setting the hyperparameterβto a very high value?Stabilizing Policy Optimization Training
Analyzing the Trade-off in a Policy Optimization Objective