1Cademy - Stabilizing Policy Gradient Training

Learn Before

Policy Gradient with Reward-to-Go and Baseline

Essay

Stabilizing Policy Gradient Training

An engineer is training a reinforcement learning agent and observes that the training process is very unstable, with large fluctuations in performance between updates. The current training algorithm updates the policy for every action in an episode by multiplying the gradient of the log-probability of that action by the total, undiscounted reward for the entire episode. Propose and justify two distinct modifications to the reward-scaling term in this calculation to reduce the observed instability. For each modification, explain the principle that makes it effective.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related