Formula

Composite Objective for PPO-Clip

The PPO-Clip training method utilizes a composite objective function that integrates a policy divergence penalty with the clipped surrogate objective (UclipU_{\text{clip}}). The formula is expressed as: Uppo-clip(τ;θ)=Uclip(τ;θ)βPenaltyU_{\text{ppo-clip}}(\tau; \theta) = U_{\text{clip}}(\tau; \theta) - \beta \text{Penalty} In this equation, the hyperparameter β\beta serves as the weight for the penalty term, controlling its influence on the overall objective.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related