A reinforcement learning agent is being trained using a utility function that incorporates an upper-bound clip on the policy probability ratio, defined as min(ratio, 1+ε), where ε is a small positive constant. Consider two distinct actions taken during an episode:
- Action A: Has a large positive advantage, and its probability ratio is
2.0. - Action B: Has a large negative advantage, and its probability ratio is
0.1.
Assuming ε = 0.2, how does this specific clipping mechanism influence the policy update derived from these two actions?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective
PPO Clipped Objective for Language Models
A reinforcement learning agent is being trained using a utility function that incorporates an upper-bound clip on the policy probability ratio, defined as
min(ratio, 1+ε), whereεis a small positive constant. Consider two distinct actions taken during an episode:- Action A: Has a large positive advantage, and its probability ratio is
2.0. - Action B: Has a large negative advantage, and its probability ratio is
0.1.
Assuming
ε = 0.2, how does this specific clipping mechanism influence the policy update derived from these two actions?- Action A: Has a large positive advantage, and its probability ratio is
A utility function that modifies the policy probability ratio
r_tusing the operationmin(r_t, 1+ε)is primarily intended to mitigate training instability caused by actions that are discovered to be substantially worse than the reference policy's actions (i.e., actions with a large negative advantage).Stabilizing Policy Gradient Training