1Cademy - Off-Policy Objective Function with Importance Sampling

Learn Before

Objective Function as Expected Cumulative Reward (Performance Function)

Formula

Off-Policy Objective Function with Importance Sampling

In off-policy reinforcement learning, the performance of a target policy $\pi_{\theta}$ can be evaluated using data generated from a different, reference policy $\pi_{\theta_{\text{ref}}}$ . This is achieved by reformulating the objective function $J(\theta)$ with importance sampling. The objective becomes the expected cumulative reward $R(\tau)$ under the reference policy, where each reward is weighted by the ratio of the trajectory probabilities between the target and reference policies. The formula is: $J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right]$ The term $\frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)}$ is the importance sampling weight, which adjusts for the discrepancy in data collection policies.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After