1Cademy - Policy Gradient Theorem

Learn Before

Concept

Policy Gradient Theorem

We want to maximize the rewards based on parameters $\theta$ , so we can use the expected return of policy π starting from a given state $s_0$ as the objective function: $J(\theta) = \sum_s \rho^{\pi}(s) \sum_a \pi(s,a) R'(s,a)$ where $R'(s,a) = \sum_{s'\in S} T(s,a,s') R(s,a,s')$ and $\rho^{\pi}(s)$ is the discounted state distribution ((meaning the probability of being at state s when following policy 𝜋)) $\rho^{\pi}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t=s|s_0, \pi)$

According to Policy Gradient Theorem, $\nabla_{\theta} J(\theta) \propto \sum_s \rho^{\pi}(s) \sum_a \nabla_{\theta} \pi(s,a;\theta) Q(s,a;\pi_{\theta})$

$\nabla_{\theta} \pi(s,a;\theta)$ is the gradient of 𝜋 given s and 𝜽 and the simplest way to estimate it is to use a score function gradient estimator because $\nabla_{\theta} \pi(s,a;\theta) = \pi (s,a;\theta) \frac{\nabla_{\theta}\pi (s,a;\theta)}{\pi (s,a;\theta)} = \pi (s,a;\theta) \nabla_{\theta} \log(\pi(s,a;\theta))$