1Cademy - In a reinforcement learning scenario, an agent is in a particular state and has two possible actions, Action A and Action B. The agents current parameterized policy assigns a non-zero probability to both actions. After sampling several trajectories, the agent estimates that the expected cumulative reward for taking Action A from this state is +10, while the expected cumulative reward for taking Action B from this state is -5. Based on the fundamental principle of updating a policy to maximize expected returns, how will the gradient update affect the probabilities of these actions?

Learn Before

Policy Gradient Theorem

Multiple Choice

In a reinforcement learning scenario, an agent is in a particular state and has two possible actions, Action A and Action B. The agent's current parameterized policy assigns a non-zero probability to both actions. After sampling several trajectories, the agent estimates that the expected cumulative reward for taking Action A from this state is +10, while the expected cumulative reward for taking Action B from this state is -5. Based on the fundamental principle of updating a policy to maximize expected returns, how will the gradient update affect the probabilities of these actions?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related