Policy Gradient Objective with Importance Sampling
To enhance the stability and reliability of policy gradient methods, importance sampling is frequently employed to refine the estimation of the utility function . By incorporating a previous or reference policy , the refined utility for a trajectory is calculated as . The ratio term serves as an importance sampling weight that scales the advantage to account for the difference between the active policy being optimized and the reference baseline.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Gradient Objective with Importance Sampling
An agent is being trained using a policy gradient method. After each update to its decision-making process (the policy), the experiences (trajectories) it previously collected are no longer perfectly representative of its new behavior. This mismatch can lead to inaccurate estimates of the value of those past trajectories, causing instability in the training process. Which of the following approaches directly addresses this issue by adjusting the value calculation to account for the change in the policy?
Evaluating Training Strategies for a Robotic Arm
Addressing Data Mismatch in Policy Gradient Training
Learn After
Clipped Utility Function with Upper-Bound Clipping
An agent's policy is being updated using an objective function that relies on importance sampling. Consider a single time step
tin a trajectory where the calculated advantageA(s_t, a_t)is large and positive. At the same time, the importance sampling ratioπ_θ(a_t|s_t) / π_θ_ref(a_t|s_t)is also large (e.g., 5.0), indicating the current policy is much more likely to choose actiona_tthan the reference policy was.Given the objective function
U(τ; θ) = Σ [π_θ(a_t|s_t) / π_θ_ref(a_t|s_t)] * A(s_t, a_t), what is the most direct consequence of this situation for this specific time step's contribution to the policy update?Calculating Trajectory Utility with Importance Sampling
In the context of updating a policy using an objective function with importance sampling, if the ratio of the current policy's probability to the reference policy's probability for a given action is greater than 1, this will always increase the likelihood of that action being selected in the subsequent policy update.