Formula

Policy Probability Ratio Less Than One

The condition πθ(atst)πθref(atst)<1\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} < 1 signifies that a specific action ata_t is less favored by the current policy πθ\pi_{\theta} than by the reference policy πθref\pi_{\theta_{\mathrm{ref}}}. This indicates that the current policy is less likely to choose that particular action compared to the reference policy.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences