1Cademy - Composite Objective for PPO-Clip

Learn Before

Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective
Proximal Policy Optimization (PPO)

Formula

Composite Objective for PPO-Clip

The PPO-Clip training method utilizes a composite objective function that integrates a policy divergence penalty with the clipped surrogate objective ( $U_{\text{clip}}$ ). The formula is expressed as: $U_{\text{ppo-clip}}(\tau; \theta) = U_{\text{clip}}(\tau; \theta) - \beta \text{Penalty}$ In this equation, the hyperparameter $\beta$ serves as the weight for the penalty term, controlling its influence on the overall objective.