A reinforcement learning agent is being trained to navigate a maze. The original reward function provides a large positive reward only upon reaching the exit. To speed up learning, a developer adds a shaping reward function that gives a small, constant positive reward for every single action the agent takes, regardless of the state. After this change, the agent learns a new policy of moving in a perpetual loop instead of solving the maze. Why did adding this specific shaping reward alter the optimal policy?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Potential-Based Shaping Function Formula
Analysis of a Flawed Reward Shaping Implementation
A reinforcement learning agent is being trained to navigate a maze. The original reward function provides a large positive reward only upon reaching the exit. To speed up learning, a developer adds a shaping reward function that gives a small, constant positive reward for every single action the agent takes, regardless of the state. After this change, the agent learns a new policy of moving in a perpetual loop instead of solving the maze. Why did adding this specific shaping reward alter the optimal policy?
Critique of an Arbitrary Shaping Function