Learn Before
An agent completes a task, which consists of a sequence of states, actions, and rewards (s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T). To improve the agent's performance, we need to adjust the likelihood of taking each action a_t at state s_t. Consider two different ways to calculate the 'quality score' used to update the action a_t:
- Method 1: The score is the sum of all rewards in the sequence:
r_1 + r_2 + ... + r_T. - Method 2: The score is the sum of rewards from time step
tonward:r_t + r_{t+1} + ... + r_T.
Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An agent completes a task, which consists of a sequence of states, actions, and rewards
(s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T). To improve the agent's performance, we need to adjust the likelihood of taking each actiona_tat states_t. Consider two different ways to calculate the 'quality score' used to update the actiona_t:- Method 1: The score is the sum of all rewards in the sequence:
r_1 + r_2 + ... + r_T. - Method 2: The score is the sum of rewards from time step
tonward:r_t + r_{t+1} + ... + r_T.
Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?
- Method 1: The score is the sum of all rewards in the sequence:
An agent completes an episode of 4 time steps, receiving the following sequence of rewards:
r_1 = -10,r_2 = +2,r_3 = +5,r_4 = -1. When updating the agent's decision-making process, what is the 'reward-to-go' value that should be associated with the action taken at time stept=2?Analyzing Credit Assignment for a Policy Update