1Cademy - An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories `τ` sampled from the policy `π_θ`: `∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]` In practice, this is estimated from a batch of `|D|` sampled trajectories using the following formula: `∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)` What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?

Learn Before

Policy Gradient Estimate under Uniform Trajectory Probability

Multiple Choice

An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories τ sampled from the policy π_θ:

∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]

In practice, this is estimated from a batch of |D| sampled trajectories using the following formula:

∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)

What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Policy Gradient Estimation from Sampled Trajectories
An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories τ sampled from the policy π_θ:

∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]

In practice, this is estimated from a batch of |D| sampled trajectories using the following formula:

∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)

What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?
Policy Gradient with Baseline
Reward-to-Go
An agent is being trained using a policy gradient method. A batch of data D is collected, containing exactly two trajectories, τ_1 and τ_2.
- Trajectory τ_1 has a total reward R(τ_1) = 10.
- Trajectory τ_2 has a total reward R(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters θ is denoted as ∇_θ log Pr_θ(τ_1) and ∇_θ log Pr_θ(τ_2), respectively.

Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient ∇J(θ) for this batch?

Learn Before

Related