1Cademy - Policy Gradient Optimization

Learn Before

Irrelevance of Past Rewards for Policy Gradient Calculation

Short Answer

Policy Gradient Optimization

In the context of calculating the gradient for a policy update at a specific time t, the full sum of rewards for an episode (∑_{k=1}^{T} r_k) is often used to weight the score function ∇_θ log π_θ(a_t|s_t). However, a more refined approach replaces the full sum of rewards with only the sum of rewards from time t onwards (∑_{k=t}^{T} r_k). Analyze this modification by explaining both the conceptual principle and the mathematical property that justify ignoring the sum of past rewards (∑_{k=1}^{t-1} r_k) in this calculation.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related