Policy Gradient Estimate under Uniform Trajectory Probability
The general form of the policy gradient is an expectation over trajectories sampled from the policy. By treating every trajectory in a sampled dataset (with size or ) as equally probable, we use the practical estimator: By decomposing the sequence probability and dropping the dynamics gradient (which does not depend on ), and knowing that the cumulative reward is , this objective expands to: This formulation highlights that the reward function does not need to be differentiable for optimization.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Policy Gradient Theorem
Advantage of Policy Gradients: Non-Differentiable Reward Functions
Decomposition of the Trajectory Log-Probability Gradient
Policy Gradient Objective with Advantage Function
Policy Gradient Estimate under Uniform Trajectory Probability
Score Function in Policy Gradients
During the derivation of the policy performance gradient, a key step transforms the expression
Σ [∂Pr_θ(τ)/∂θ] R(τ)into a form that includes the term∂log Pr_θ(τ)/∂θ. What is the primary analytical purpose of this transformation?The following equations represent key steps in deriving the policy gradient. Arrange them in the correct logical order, starting from the initial gradient of the objective function to its final form as an expectation. Note: J(θ) is the objective function, Pr_θ(τ) is the probability of a trajectory τ under policy parameters θ, and R(τ) is the reward for that trajectory.
Analyzing a Flawed Policy Gradient Derivation
Policy Gradient Estimate under Uniform Trajectory Probability
In policy gradient methods, the gradient of the log-probability of a trajectory is initially expressed as the sum of two components: one related to the agent's actions and another related to the environment's transitions. The expression is then simplified by removing the environment's component before optimization. Given the initial expression: What is the fundamental assumption that justifies simplifying this to just the policy component, ?
Applicability of Policy Gradient Methods
Practical Implications of the Policy Gradient Simplification
Learn After
Policy Gradient Estimation from Sampled Trajectories
An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories
τsampled from the policyπ_θ:∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]In practice, this is estimated from a batch of
|D|sampled trajectories using the following formula:∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?
Policy Gradient with Baseline
Reward-to-Go
An agent is being trained using a policy gradient method. A batch of data
Dis collected, containing exactly two trajectories,τ_1andτ_2.- Trajectory
τ_1has a total rewardR(τ_1) = 10. - Trajectory
τ_2has a total rewardR(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters
θis denoted as∇_θ log Pr_θ(τ_1)and∇_θ log Pr_θ(τ_2), respectively.Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient
∇J(θ)for this batch?- Trajectory