Formula

Policy Probability Ratio Greater Than One

The inequality πθ(atst)πθref(atst)>1\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} > 1 expresses the condition where the probability of selecting action ata_t in state sts_t under the current policy πθ\pi_{\theta} is greater than the probability under a reference policy πθref\pi_{\theta_{\mathrm{ref}}}. This signifies that the current policy is more likely to choose the action ata_t than the reference policy. This comparison is a fundamental component in certain reinforcement learning algorithms, particularly in policy optimization methods, where the goal is to adjust the policy πθ\pi_{\theta} to be more favorable than a baseline or previous iteration of the policy.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences