Multiple Choice

In policy optimization, an objective function is often constructed using data from a fixed, older policy (the 'reference policy') to estimate the performance of a new policy being optimized. This objective uses an importance sampling ratio: Expectation_over_trajectories_from_reference_policy [ (Probability_of_trajectory_under_new_policy / Probability_of_trajectory_under_reference_policy) * Reward_of_trajectory ]. A critical property of this objective is that its gradient, when evaluated at the point where the new policy is identical to the reference policy, is exactly equal to the standard on-policy policy gradient. Which of the following statements provides the core mathematical justification for why this equivalence holds?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related