Formula

Off-Policy Objective Function with Importance Sampling

In off-policy reinforcement learning, the performance of a target policy πθ\pi_{\theta} can be evaluated using data generated from a different, reference policy πθref\pi_{\theta_{\text{ref}}}. This is achieved by reformulating the objective function J(θ)J(\theta) with importance sampling. The objective becomes the expected cumulative reward R(τ)R(\tau) under the reference policy, where each reward is weighted by the ratio of the trajectory probabilities between the target and reference policies. The formula is: J(θ)=Eτπθref[Prθ(τ)Prθref(τ)R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] The term Prθ(τ)Prθref(τ)\frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} is the importance sampling weight, which adjusts for the discrepancy in data collection policies.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences