1Cademy - Diagnosing Policy Update Instability

Learn Before

Policy Gradient with Advantage Function Formula

Case Study

Diagnosing Policy Update Instability

An engineer is training a reinforcement learning agent using the policy update rule shown below, where π_θ(a_t|s_t) is the agent's policy and A(s_t, a_t) is the advantage function. Based on the components of this formula, what characteristic of the policy π_θ itself is the most likely cause of the observed instability? Explain your reasoning.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related