1Cademy - Stabilizing Policy Updates with a Divergence Penalty

Learn Before

Penalty-Based Trust Region Implementation

Concept

Stabilizing Policy Updates with a Divergence Penalty

By incorporating a policy divergence penalty into the optimization objective, the learning process is stabilized. This penalty discourages the current policy from straying too far from a reference policy, thereby limiting excessively large updates that could disrupt training.

Updated 2025-10-07

Contributors are: