Equivalence of Surrogate and On-Policy Gradients at the Reference Point
A key property of the importance-sampled surrogate objective is that its gradient, when evaluated at the reference policy parameters (), is identical to the standard on-policy policy gradient. This is demonstrated by moving the derivative inside the expectation, which is valid since the sampling distribution does not depend on : The term inside the expectation on the right, , when evaluated at , is equivalent to . Thus, the right-hand side is the definition of the on-policy policy gradient evaluated at the reference policy. This equivalence ensures that at the beginning of an optimization step, the update direction for the surrogate objective is the same as the true policy gradient.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
In a reinforcement learning scenario, an agent is in a particular state and has two possible actions, Action A and Action B. The agent's current parameterized policy assigns a non-zero probability to both actions. After sampling several trajectories, the agent estimates that the expected cumulative reward for taking Action A from this state is +10, while the expected cumulative reward for taking Action B from this state is -5. Based on the fundamental principle of updating a policy to maximize expected returns, how will the gradient update affect the probabilities of these actions?
Diagnosing Learning Issues in Policy Gradients
An agent's learning process involves updating its decision-making parameters (θ) based on experience. The update rule is proportional to the expression: Σ_s ρ(s) Σ_a ∇_θ π(s,a) Q(s,a). Match each mathematical component from this expression to its conceptual role in guiding the learning update.
Learn After
In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?
In a policy optimization algorithm that uses an importance-sampled surrogate objective, a developer observes that the gradient of the surrogate objective is identical to the on-policy policy gradient at the start of an update step. However, after applying a single gradient update to the policy parameters, the two gradients are no longer identical. This divergence indicates a flaw in the algorithm's implementation.
In policy optimization, an objective function is often constructed using data from a fixed, older policy (the 'reference policy') to estimate the performance of a new policy being optimized. This objective uses an importance sampling ratio:
Expectation_over_trajectories_from_reference_policy [ (Probability_of_trajectory_under_new_policy / Probability_of_trajectory_under_reference_policy) * Reward_of_trajectory ]. A critical property of this objective is that its gradient, when evaluated at the point where the new policy is identical to the reference policy, is exactly equal to the standard on-policy policy gradient. Which of the following statements provides the core mathematical justification for why this equivalence holds?