Multiple Choice

An agent's policy is being updated using an objective function that relies on importance sampling. Consider a single time step t in a trajectory where the calculated advantage A(s_t, a_t) is large and positive. At the same time, the importance sampling ratio π_θ(a_t|s_t) / π_θ_ref(a_t|s_t) is also large (e.g., 5.0), indicating the current policy is much more likely to choose action a_t than the reference policy was.

Given the objective function U(τ; θ) = Σ [π_θ(a_t|s_t) / π_θ_ref(a_t|s_t)] * A(s_t, a_t), what is the most direct consequence of this situation for this specific time step's contribution to the policy update?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science