Formula

Overall PPO Objective Function for Language Models

The overall objective function for training language models with Proximal Policy Optimization (PPO), denoted as UU, combines the clipped surrogate objective with a policy divergence penalty. This composite objective is formulated as: U(x,y;θ)=Uppo-clip(x,y;θ)βPenaltyU(\mathbf{x}, \mathbf{y}; \theta) = U_{\text{ppo-clip}}(\mathbf{x}, \mathbf{y}; \theta) - \beta \text{Penalty} In this equation, Uppo-clipU_{\text{ppo-clip}} represents the PPO clipped objective, while the Penalty term quantifies the divergence from a reference policy. The hyperparameter β\beta serves as a coefficient to control the magnitude of this penalty.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related