Learn Before
Multiple Choice

An agent completes a task, which consists of a sequence of states, actions, and rewards (s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T). To improve the agent's performance, we need to adjust the likelihood of taking each action a_t at state s_t. Consider two different ways to calculate the 'quality score' used to update the action a_t:

  • Method 1: The score is the sum of all rewards in the sequence: r_1 + r_2 + ... + r_T.
  • Method 2: The score is the sum of rewards from time step t onward: r_t + r_{t+1} + ... + r_T.

Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science