1Cademy - Overall PPO Objective Function for Language Models

Learn Before

Proximal Policy Optimization (PPO)
PPO Clipped Objective for Language Models
KL-Divergence Penalty in RLHF Policy Optimization
Policy Divergence Penalty for Language Models

Formula

Overall PPO Objective Function for Language Models

The overall objective function for training language models with Proximal Policy Optimization (PPO), denoted as $U$ , combines the clipped surrogate objective with a policy divergence penalty. This composite objective is formulated as: $U(\mathbf{x}, \mathbf{y}; \theta) = U_{\text{ppo-clip}}(\mathbf{x}, \mathbf{y}; \theta) - \beta \text{Penalty}$ In this equation, $U_{\text{ppo-clip}}$ represents the PPO clipped objective, while the Penalty term quantifies the divergence from a reference policy. The hyperparameter $\beta$ serves as a coefficient to control the magnitude of this penalty.