Case Study

Policy Gradient Estimation from Sampled Trajectories

An agent interacts with an environment, producing a dataset D of two trajectories (τ_1, τ_2). For each trajectory, the total reward R(τ) and the gradient of the log-probability of the trajectory (the score function) ∇_θ log Pr_θ(τ) have been computed. Based on the data below, calculate the policy gradient estimate ∇_θ J(θ) that results from assuming each sampled trajectory is equally probable.

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science