Learn Before
Clipped Surrogate Objective Function
To address the high variance and resultant instability in policy gradient estimates, a clipped surrogate objective function is widely used. This objective incorporates a clipping mechanism to bound the importance weights, ensuring that individual policy updates do not become excessively large. The clipped utility function is formally defined as: where the clipping operation restricts the probability ratio using a specified boundary hyperparameter : .
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function