1Cademy - An agent completes a task, which consists of a sequence of states, actions, and rewards `(s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T)`. To improve the agents performance, we need to adjust the likelihood of taking each action `a_t` at state `s_t`. Consider two different ways to calculate the quality score used to update the action `a_t`: - **Method 1:** The score is the sum of all rewards in the sequence: `r_1 + r_2 + ... + r_T`. - **Method 2:** The score is the sum of rewards from time step `t` onward: `r_t + r_{t+1} + ... + r_T`. Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?

Learn Before

Reward-to-Go

Multiple Choice

An agent completes a task, which consists of a sequence of states, actions, and rewards (s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T). To improve the agent's performance, we need to adjust the likelihood of taking each action a_t at state s_t. Consider two different ways to calculate the 'quality score' used to update the action a_t:

Method 1: The score is the sum of all rewards in the sequence: r_1 + r_2 + ... + r_T.
Method 2: The score is the sum of rewards from time step t onward: r_t + r_{t+1} + ... + r_T.

Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related