Learn Before
Separation of Sampling and Reward Computation in Policy Learning
The formulation of the surrogate objective allows for a functional separation between the sequence sampling process and the reward computation. By utilizing a baseline policy (parameterized by ) to sample a batch of sequences, and then applying the target policy (parameterized by ) to compute the expected reward, the overall procedure avoids the need to sample directly from the policy being evaluated. This decoupling is especially beneficial in reinforcement learning scenarios where generating trajectories from the target policy is computationally expensive or difficult.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function