1Cademy - Policy Gradient Estimation from Sampled Trajectories

Learn Before

Policy Gradient Estimate under Uniform Trajectory Probability

Case Study

Policy Gradient Estimation from Sampled Trajectories

An agent interacts with an environment, producing a dataset D of two trajectories (τ_1, τ_2). For each trajectory, the total reward R(τ) and the gradient of the log-probability of the trajectory (the score function) ∇_θ log Pr_θ(τ) have been computed. Based on the data below, calculate the policy gradient estimate ∇_θ J(θ) that results from assuming each sampled trajectory is equally probable.

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related