Learn Before
Multiple Choice

In a reinforcement learning scenario, an agent is in a particular state and has two possible actions, Action A and Action B. The agent's current parameterized policy assigns a non-zero probability to both actions. After sampling several trajectories, the agent estimates that the expected cumulative reward for taking Action A from this state is +10, while the expected cumulative reward for taking Action B from this state is -5. Based on the fundamental principle of updating a policy to maximize expected returns, how will the gradient update affect the probabilities of these actions?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Data Science

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science