Learn Before
Debugging a Policy Update Calculation
Based on the causality principle that governs how actions relate to rewards over time, identify the fundamental error in the scoring method described in the case study below and explain why it is incorrect for evaluating the action at t=20.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Irrelevance of Past Rewards for Policy Gradient Calculation
An autonomous agent completes a task over four time steps. The sequence of actions and resulting rewards is as follows:
- Time t=1: Action
a_1-> Rewardr_1 = 0 - Time t=2: Action
a_2-> Rewardr_2 = 0 - Time t=3: Action
a_3-> Rewardr_3 = -1 - Time t=4: Action
a_4-> Rewardr_4 = +10
When evaluating the decision to take action
a_2at time t=2, which rewards should be considered as being potentially influenced by this specific action?- Time t=1: Action
An agent is learning to play a video game. At time step
t=5, the agent performs an action (e.g., jumping). According to the causality principle in this context, this specific action att=5can alter the reward that was already received at time stept=3.Causality Principle in Policy Gradient Calculation
Debugging a Policy Update Calculation