Surrogate Objective in Reinforcement Learning
In reinforcement learning, a surrogate objective is an alternative objective function that is optimized in place of the true performance function. A common example is the off-policy objective, which uses importance sampling to evaluate the current policy with data from a reference policy: . This formulation acts as a proxy, or surrogate, for the true on-policy objective .
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Surrogate Objective in Reinforcement Learning
Equivalence of the Surrogate Objective and the On-Policy Objective
An agent's performance is being evaluated using a set of recorded experiences (trajectories) that were generated by an older, reference policy. The new, target policy being evaluated makes a specific high-reward trajectory significantly less probable than the reference policy did. How will the contribution of this specific high-reward trajectory be adjusted when estimating the performance of the new target policy?
Off-Policy Performance Estimation
Consider an off-policy evaluation scenario where the performance of a 'target' policy is estimated using data collected from a 'reference' policy. If the target policy is identical to the reference policy, the importance sampling weight used to adjust the reward of every possible trajectory will be exactly 1.
Learn After
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function