1Cademy - In policy gradient methods, the gradient of the performance objective is estimated as an expectation over trajectories. Each trajectorys contribution to this estimate is the product of its cumulative reward and the gradient of its log-probability. Given this structure, why can these methods effectively handle tasks with non-differentiable reward functions, such as a simple binary reward for winning or losing a game?

Learn Before

Advantage of Policy Gradients: Non-Differentiable Reward Functions

Multiple Choice

In policy gradient methods, the gradient of the performance objective is estimated as an expectation over trajectories. Each trajectory's contribution to this estimate is the product of its cumulative reward and the gradient of its log-probability. Given this structure, why can these methods effectively handle tasks with non-differentiable reward functions, such as a simple binary reward for winning or losing a game?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related