Multiple Choice

In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative 'surrogate' objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J_on-policy(θ)?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science