1Cademy - Diagnosing Learning Issues in Policy Gradients

Learn Before

Policy Gradient Theorem

Short Answer

Diagnosing Learning Issues in Policy Gradients

A reinforcement learning agent is being trained using a simple policy gradient method to navigate a long maze. A reward of +100 is given only upon reaching the exit, and a reward of 0 is given for all other steps. The agent's performance is not improving. The current implementation updates the policy parameters at the end of each episode by weighting the gradient of the log-policy for every action taken in the episode by the total cumulative reward of that episode. Based on the principles of the Policy Gradient Theorem, which states that the policy gradient is weighted by the action-value function Q(s,a), explain why using the total episode reward as a proxy for Q(s,a) is likely causing poor learning performance in this specific scenario.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related