Concept

Policy Gradient Theorem

We want to maximize the rewards based on parameters θ\theta, so we can use the expected return of policy π starting from a given state s0s_0 as the objective function: J(θ)=sρπ(s)aπ(s,a)R(s,a)J(\theta) = \sum_s \rho^{\pi}(s) \sum_a \pi(s,a) R'(s,a) where R(s,a)=sST(s,a,s)R(s,a,s)R'(s,a) = \sum_{s'\in S} T(s,a,s') R(s,a,s') and ρπ(s)\rho^{\pi}(s) is the discounted state distribution ((meaning the probability of being at state s when following policy 𝜋)) ρπ(s)=t=0γtP(st=ss0,π)\rho^{\pi}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t=s|s_0, \pi)

According to Policy Gradient Theorem, θJ(θ)sρπ(s)aθπ(s,a;θ)Q(s,a;πθ)\nabla_{\theta} J(\theta) \propto \sum_s \rho^{\pi}(s) \sum_a \nabla_{\theta} \pi(s,a;\theta) Q(s,a;\pi_{\theta})

θπ(s,a;θ)\nabla_{\theta} \pi(s,a;\theta) is the gradient of 𝜋 given s and 𝜽 and the simplest way to estimate it is to use a score function gradient estimator because θπ(s,a;θ)=π(s,a;θ)θπ(s,a;θ)π(s,a;θ)=π(s,a;θ)θlog(π(s,a;θ))\nabla_{\theta} \pi(s,a;\theta) = \pi (s,a;\theta) \frac{\nabla_{\theta}\pi (s,a;\theta)}{\pi (s,a;\theta)} = \pi (s,a;\theta) \nabla_{\theta} \log(\pi(s,a;\theta))

Q(s,a;πθ) Q(s,a;\pi_{\theta}) is the action value function under 𝜋, the simplest way to estimate it is use the cumulative return from entire trajectories.

0

1

Updated 2026-05-02

Tags

Data Science

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences