The following steps demonstrate that the surrogate objective, which uses importance sampling, is equivalent to the standard on-policy objective in reinforcement learning. Arrange these mathematical steps in the correct logical order to form the complete derivation.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative 'surrogate' objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J_on-policy(θ)?
The following steps demonstrate that the surrogate objective, which uses importance sampling, is equivalent to the standard on-policy objective in reinforcement learning. Arrange these mathematical steps in the correct logical order to form the complete derivation.
Critique of Surrogate Objective Approximation