Learn Before
Stabilizing Policy Gradient Training
An engineer is training a reinforcement learning agent and observes that the training process is very unstable, with large fluctuations in performance between updates. The current training algorithm updates the policy for every action in an episode by multiplying the gradient of the log-probability of that action by the total, undiscounted reward for the entire episode. Propose and justify two distinct modifications to the reward-scaling term in this calculation to reduce the observed instability. For each modification, explain the principle that makes it effective.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analysis of Policy Update Mechanisms
An agent completes an episode with the following sequence of rewards:
r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time stept=2, a baseline value ofb(s_2) = 5is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action att=2?Stabilizing Policy Gradient Training