1Cademy - Debugging a Policy Update Calculation

Learn Before

Causality Constraint in Reinforcement Learning

Case Study

Debugging a Policy Update Calculation

Based on the causality principle that governs how actions relate to rewards over time, identify the fundamental error in the scoring method described in the case study below and explain why it is incorrect for evaluating the action at t=20.

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Irrelevance of Past Rewards for Policy Gradient Calculation
An autonomous agent completes a task over four time steps. The sequence of actions and resulting rewards is as follows:
- Time t=1: Action a_1 -> Reward r_1 = 0
- Time t=2: Action a_2 -> Reward r_2 = 0
- Time t=3: Action a_3 -> Reward r_3 = -1
- Time t=4: Action a_4 -> Reward r_4 = +10
When evaluating the decision to take action a_2 at time t=2, which rewards should be considered as being potentially influenced by this specific action?
An agent is learning to play a video game. At time step t=5, the agent performs an action (e.g., jumping). According to the causality principle in this context, this specific action at t=5 can alter the reward that was already received at time step t=3.
Causality Principle in Policy Gradient Calculation
Debugging a Policy Update Calculation

Learn Before

Related