An agent is being trained using a policy gradient method. A batch of data D is collected, containing exactly two trajectories, τ_1 and τ_2.
- Trajectory
τ_1has a total rewardR(τ_1) = 10. - Trajectory
τ_2has a total rewardR(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters θ is denoted as ∇_θ log Pr_θ(τ_1) and ∇_θ log Pr_θ(τ_2), respectively.
Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient ∇J(θ) for this batch?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient Estimation from Sampled Trajectories
An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories
τsampled from the policyπ_θ:∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]In practice, this is estimated from a batch of
|D|sampled trajectories using the following formula:∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?
Policy Gradient with Baseline
Reward-to-Go
An agent is being trained using a policy gradient method. A batch of data
Dis collected, containing exactly two trajectories,τ_1andτ_2.- Trajectory
τ_1has a total rewardR(τ_1) = 10. - Trajectory
τ_2has a total rewardR(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters
θis denoted as∇_θ log Pr_θ(τ_1)and∇_θ log Pr_θ(τ_2), respectively.Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient
∇J(θ)for this batch?- Trajectory