1Cademy - An agents policy is being updated using an objective function that relies on importance sampling. Consider a single time step `t` in a trajectory where the calculated advantage `A(s_t, a_t)` is large and positive. At the same time, the importance sampling ratio `π_θ(a_t|s_t) / π_θ_ref(a_t|s_t)` is also large (e.g., 5.0), indicating the current policy is much more likely to choose action `a_t` than the reference policy was.<br><br>Given the objective function `U(τ; θ) = Σ [π_θ(a_t|s_t) / π_θ_ref(a_t|s_t)] * A(s_t, a_t)`, what is the most direct consequence of this situation for this specific time steps contribution to the policy update?

Learn Before

Policy Gradient Objective with Importance Sampling

Multiple Choice

An agent's policy is being updated using an objective function that relies on importance sampling. Consider a single time step t in a trajectory where the calculated advantage A(s_t, a_t) is large and positive. At the same time, the importance sampling ratio π_θ(a_t|s_t) / π_θ_ref(a_t|s_t) is also large (e.g., 5.0), indicating the current policy is much more likely to choose action a_t than the reference policy was.

Given the objective function U(τ; θ) = Σ [π_θ(a_t|s_t) / π_θ_ref(a_t|s_t)] * A(s_t, a_t), what is the most direct consequence of this situation for this specific time step's contribution to the policy update?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related