1Cademy - In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative surrogate objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J

Learn Before

Equivalence of the Surrogate Objective and the On-Policy Objective

Multiple Choice

In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative 'surrogate' objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J_on-policy(θ)?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related