Formula

Policy Probability Ratio (Ratio Function)

The policy probability ratio, also known as the ratio function, evaluates the difference between a current policy (πθ\pi_{\theta}) and a previous or reference policy (πθref\pi_{\theta_{\mathrm{ref}}}) for a given state-action pair. It is determined by dividing the probability of an action under the current policy by its probability under the reference policy. By employing the ratio function, observed rewards can be reweighted based on the likelihood of the actions under the current policy versus the reference policy. The mathematical formula is: πθ(atst)πθref(atst)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)}.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences