1Cademy - Policy Probability Ratio (Ratio Function)

Agent 1: When in state &#x27;S&#x27;, it is programmed to always choose the action &#x27;move North&#x27;.
Agent 2: When in state &#x27;S&#x27;, it is programmed to choose &#x27;move North&#x27; with 70% probability and &#x27;move East&#x27; with 30% probability.

Learn Before

Policy in Reinforcement Learning ( $\pi$ )

Formula

Policy Probability Ratio (Ratio Function)

The policy probability ratio, also known as the ratio function, evaluates the difference between a current policy ( $\pi_{\theta}$ ) and a previous or reference policy ( $\pi_{\theta_{\mathrm{ref}}}$ ) for a given state-action pair. It is determined by dividing the probability of an action under the current policy by its probability under the reference policy. By employing the ratio function, observed rewards can be reweighted based on the likelihood of the actions under the current policy versus the reference policy. The mathematical formula is: $\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)}$ .