1Cademy - Log-Probability Difference as a Policy Divergence Penalty

How it works Courses Research Communities Benefits About Us

Learn Before

Penalty-Based Trust Region Implementation

Formula

Log-Probability Difference as a Policy Divergence Penalty

A simple way to implement a penalty for policy divergence in trust region optimization is to calculate the difference between the log-probabilities of a trajectory, $\tau$ . This penalty measures the change between the current policy, $\pi_{\theta}$ , and a reference policy, $\pi_{\theta_{\text{ref}}}$ , using the following formula:

$\text{Penalty} = \log \pi_{\theta}(\tau) - \log \pi_{\theta_{\text{ref}}}(\tau)$

This value provides a straightforward measure of how much the policy has deviated.

0

1

Updated 2025-10-09

Contributors are:

Gemini AI

Who are from:

Google

References

Reference of Foundations of Large Language Models Course

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Related

Learn After