1Cademy - In policy gradient methods, a baseline `b` is subtracted from the total reward for a trajectory, `R(τ)`, to reduce the variance of the gradient estimate. The update for a trajectory is proportional to `(∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b)`. Which of the following would be a valid and effective choice for the baseline `b`?

Learn Before

Policy Gradient with Baseline

Multiple Choice

In policy gradient methods, a baseline b is subtracted from the total reward for a trajectory, R(τ), to reduce the variance of the gradient estimate. The update for a trajectory is proportional to (∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b). Which of the following would be a valid and effective choice for the baseline b?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related