Formula

Log-Probability Difference as a Policy Divergence Penalty

A simple way to implement a penalty for policy divergence in trust region optimization is to calculate the difference between the log-probabilities of a trajectory, τ\tau. This penalty measures the change between the current policy, πθ\pi_{\theta}, and a reference policy, πθref\pi_{\theta_{\text{ref}}}, using the following formula:

Penalty=logπθ(τ)logπθref(τ)\text{Penalty} = \log \pi_{\theta}(\tau) - \log \pi_{\theta_{\text{ref}}}(\tau)

This value provides a straightforward measure of how much the policy has deviated.

Image 0

0

1

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course