Concept

Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective

The policy divergence penalty can be integrated into the clipped surrogate objective function to create a new, composite objective. The purpose of adding this penalty is to encourage the current policy to remain close to the reference policy, thereby limiting large updates that could destabilize the learning process. This combined objective thus constrains policy updates through both clipping and penalizing divergence from a reference policy.

Image 0

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences