In reinforcement learning, there is a mathematical proof that for every specific task, an optimal reward function exists that achieves the fastest convergence. However, identifying such a function is extremely difficult. The challenge of finding this ideal function is known as the Optimal Reward Problem (ORP).

University of Michigan - Ann Arbor

Google

Rewards focus on the immediate context while value functions can focus on the long term. For instance, an action can have a low immediate reward while in long term can have a high value.

Reward vs. Value Function

It's very important to know some common cases of abnormal behaviors due to improper reward setting. There are three types of them: rackless behavior, timid behavior and greedy behavior.  Reackless behavior will cause the agent ignoring some given serious side affect and leading to meaningless results especially in multi-task RL conditions. Timid Behavior will cause the agent stagnant. And greedy behavior will cause serious meaningless iterations and make the agent ignoring long-term reward. 

Abnormal Behavior Types Due to Improper Reward Setting

1. Define dynamic potentials from reward function
2. Inverse Reinforcement Learning.
3. Reward Shaping via Meta-Learning.

Reward Construction Direction without a Prior Estimate

Reward shaping is a technique used to address the challenge of sparse rewards by providing more frequent, intermediate feedback to an agent. As proposed by Andrew Ng, it involves augmenting the original reward function with a potential-based function that depends only on the state. This addition guides the agent's learning without changing the optimal policy, helping to solve problems like meaningless iteration that can arise from delayed rewards.

Learn Before

Related