Derivation of the Policy Gradient Objective Function
The gradient of the policy performance objective, , with respect to the policy parameters is derived using the log-derivative trick. This mathematical technique transforms the derivative into an expectation that can be estimated from sampled trajectories. The derivation is as follows: By multiplying and dividing by , we can introduce the gradient of the logarithm: This final form shows that the policy gradient is the expected value of the score function, , weighted by the cumulative reward, .
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison
Learn After
Policy Gradient Theorem
Advantage of Policy Gradients: Non-Differentiable Reward Functions
Decomposition of the Trajectory Log-Probability Gradient
Policy Gradient Objective with Advantage Function
Policy Gradient Estimate under Uniform Trajectory Probability
Score Function in Policy Gradients
During the derivation of the policy performance gradient, a key step transforms the expression
Σ [∂Pr_θ(τ)/∂θ] R(τ)into a form that includes the term∂log Pr_θ(τ)/∂θ. What is the primary analytical purpose of this transformation?The following equations represent key steps in deriving the policy gradient. Arrange them in the correct logical order, starting from the initial gradient of the objective function to its final form as an expectation. Note: J(θ) is the objective function, Pr_θ(τ) is the probability of a trajectory τ under policy parameters θ, and R(τ) is the reward for that trajectory.
Analyzing a Flawed Policy Gradient Derivation