Critique of Surrogate Objective Approximation
A reinforcement learning practitioner argues, 'Using an importance-sampled surrogate objective to evaluate a new policy with old data is inherently an approximation. Because the data comes from a different policy, the result can't be exactly the same as the true on-policy performance.' Critique this statement. Is the practitioner's reasoning correct? Justify your answer by explaining the mathematical relationship between the surrogate and on-policy objectives.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative 'surrogate' objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J_on-policy(θ)?
The following steps demonstrate that the surrogate objective, which uses importance sampling, is equivalent to the standard on-policy objective in reinforcement learning. Arrange these mathematical steps in the correct logical order to form the complete derivation.
Critique of Surrogate Objective Approximation