Multiple Choice

An agent is trained using a policy gradient method where the policy is updated based on the total reward of an entire trajectory. Consider two different trajectories that result in the same total reward:

  • Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
  • Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.

Which of the following statements best analyzes the impact of these reward distributions on the policy update?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science