1Cademy - Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective

Action A: Has a large positive advantage, and its probability ratio is 2.0 .
Action B: Has a large negative advantage, and its probability ratio is 0.1 .

Learn Before

Clipped Utility Function with Upper-Bound Clipping

Concept

Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective

The policy divergence penalty can be integrated into the clipped surrogate objective function to create a new, composite objective. The purpose of adding this penalty is to encourage the current policy to remain close to the reference policy, thereby limiting large updates that could destabilize the learning process. This combined objective thus constrains policy updates through both clipping and penalizing divergence from a reference policy.