Learn Before
Causality Principle in Policy Gradient Calculation
In reinforcement learning, the principle of causality dictates that an action taken at a specific time step can only affect rewards from that point forward, not those already received. As a result, rewards accumulated before time are considered "fixed" or constant by the time the action at is chosen. This implies that the sum of past rewards does not influence the gradient of the policy at time , a key insight used in deriving policy gradient algorithms.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Irrelevance of Past Rewards for Policy Gradient Calculation
An autonomous agent completes a task over four time steps. The sequence of actions and resulting rewards is as follows:
- Time t=1: Action
a_1-> Rewardr_1 = 0 - Time t=2: Action
a_2-> Rewardr_2 = 0 - Time t=3: Action
a_3-> Rewardr_3 = -1 - Time t=4: Action
a_4-> Rewardr_4 = +10
When evaluating the decision to take action
a_2at time t=2, which rewards should be considered as being potentially influenced by this specific action?- Time t=1: Action
An agent is learning to play a video game. At time step
t=5, the agent performs an action (e.g., jumping). According to the causality principle in this context, this specific action att=5can alter the reward that was already received at time stept=3.Causality Principle in Policy Gradient Calculation
Debugging a Policy Update Calculation
Learn After
Sum of Past Rewards Notation
Optimizing Gradient Calculation in a Learning Agent
In the derivation of a policy gradient algorithm, we aim to update a policy based on actions taken within an episode. A core principle states that an action taken at a specific time step, , can only influence rewards received from that point forward (). Given this principle, which of the following mathematical expressions correctly identifies the reward term that should be used to scale the gradient update for the action at time step ?
Justification for Policy Gradient Simplification