Learn Before
Analyzing the Trade-off in Policy Optimization
Consider the objective function used for policy optimization in reinforcement learning with human feedback. This function includes a term to maximize rewards and a penalty term scaled by a coefficient, β, to regulate divergence from a reference policy. Analyze the distinct consequences for the language model's generated outputs if the coefficient β is set to a very large value versus if it is set to zero. Explain the reasoning behind each outcome.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective for LLM Training
Derivation of the KL Divergence Objective for Policy Optimization
During the policy optimization stage of training a large language model, an engineer observes that the model's outputs are coherent and safe, but they show very little improvement over the initial supervised fine-tuned version and consistently receive mediocre scores from the reward model. Which of the following is the most likely cause of this issue, based on the policy optimization objective function that balances maximizing rewards with a penalty for policy divergence?
Analyzing the Trade-off in Policy Optimization
Analyzing a Modified Policy Optimization Objective