1Cademy - An agent is being trained using a policy gradient method. A batch of data `D` is collected, containing exactly two trajectories, `τ_1` and `τ_2`. - Trajectory `τ_1` has a total reward `R(τ_1) = 10`. - Trajectory `τ_2` has a total reward `R(τ_2) = -5`. The gradient of the log-probability for each trajectory with respect to the policy parameters `θ` is denoted as `∇_θ log Pr_θ(τ_1)` and `∇_θ log Pr_θ(τ_2)`, respectively. Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient `∇J(θ)` for this batch?

Learn Before

Policy Gradient Estimate under Uniform Trajectory Probability

Multiple Choice

An agent is being trained using a policy gradient method. A batch of data D is collected, containing exactly two trajectories, τ_1 and τ_2.

Trajectory τ_1 has a total reward R(τ_1) = 10.
Trajectory τ_2 has a total reward R(τ_2) = -5.

The gradient of the log-probability for each trajectory with respect to the policy parameters θ is denoted as ∇_θ log Pr_θ(τ_1) and ∇_θ log Pr_θ(τ_2), respectively.

Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient ∇J(θ) for this batch?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related