1Cademy - In a reinforcement learning scenario, the performance of a new policy, defined by parameters θ, is often estimated using an objective function that relies on data collected from a reference policy, defined by parameters θ_ref. This objective function is given by: $$ J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] $$ where τ represents a trajectory, Pr(τ) is the probability of that trajectory, and R(τ) is its total reward. Which of the following statements most accurately evaluates the relationship between this objective function, J(θ), and the true expected reward of the *reference policy*, $ \mathbb{E}_{\tau \sim \pi_{\theta

Learn Before

Surrogate Objective at the Policy Reference Point

Multiple Choice

In a reinforcement learning scenario, the performance of a new policy, defined by parameters θ, is often estimated using an objective function that relies on data collected from a reference policy, defined by parameters θ_ref. This objective function is given by: $J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right]$ where τ represents a trajectory, Pr(τ) is the probability of that trajectory, and R(τ) is its total reward. Which of the following statements most accurately evaluates the relationship between this objective function, J(θ), and the true expected reward of the reference policy, $\mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} [R(\tau)]$ ?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related