In a reinforcement learning context, a policy is updated by maximizing an objective function. Consider an objective function that incorporates two distinct mechanisms to control the size of policy updates relative to a reference policy:
- A 'clipping' mechanism that puts a hard limit on the probability ratio between the new and reference policies, effectively creating a boundary beyond which the objective does not increase for a given sample.
- A 'penalty' term that is subtracted from the objective, with its magnitude increasing as the new policy diverges from the reference policy across all samples.
What is the most accurate analytical reason for using both of these mechanisms together, rather than relying on just one?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Proximal Policy Optimization (PPO)
In a reinforcement learning context, a policy is updated by maximizing an objective function. Consider an objective function that incorporates two distinct mechanisms to control the size of policy updates relative to a reference policy:
- A 'clipping' mechanism that puts a hard limit on the probability ratio between the new and reference policies, effectively creating a boundary beyond which the objective does not increase for a given sample.
- A 'penalty' term that is subtracted from the objective, with its magnitude increasing as the new policy diverges from the reference policy across all samples.
What is the most accurate analytical reason for using both of these mechanisms together, rather than relying on just one?
Diagnosing Training Instability in Reinforcement Learning
Complementary Roles of Policy Update Constraints
Composite Objective for PPO-Clip