Learn Before
Variance in Surrogate Objective Gradient Estimates
A primary challenge when utilizing an unclipped surrogate objective function for policy optimization is that the variance in its resulting gradient estimates is typically quite high. This elevated variance introduces significant noise into the parameter updates, which can destabilize the overall learning process and hinder reliable convergence.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function