Essay

Stabilizing Policy Gradient Training

An engineer is training a reinforcement learning agent and observes that the training process is very unstable, with large fluctuations in performance between updates. The current training algorithm updates the policy for every action in an episode by multiplying the gradient of the log-probability of that action by the total, undiscounted reward for the entire episode. Propose and justify two distinct modifications to the reward-scaling term in this calculation to reduce the observed instability. For each modification, explain the principle that makes it effective.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science