Learn Before
Short Answer

Diagnosing Learning Issues in Policy Gradients

A reinforcement learning agent is being trained using a simple policy gradient method to navigate a long maze. A reward of +100 is given only upon reaching the exit, and a reward of 0 is given for all other steps. The agent's performance is not improving. The current implementation updates the policy parameters at the end of each episode by weighting the gradient of the log-policy for every action taken in the episode by the total cumulative reward of that episode. Based on the principles of the Policy Gradient Theorem, which states that the policy gradient is weighted by the action-value function Q(s,a), explain why using the total episode reward as a proxy for Q(s,a) is likely causing poor learning performance in this specific scenario.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Data Science

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science