Formula

Policy Gradient Objective with Importance Sampling

To enhance the stability and reliability of policy gradient methods, importance sampling is frequently employed to refine the estimation of the utility function U(τ;θ)U(\tau; \theta). By incorporating a previous or reference policy πθref\pi_{\theta_{\mathrm{ref}}}, the refined utility for a trajectory τ\tau is calculated as U(τ;θ)=t=1Tπθ(atst)πθref(atst)A(st,at)U(\tau; \theta) = \sum_{t=1}^{T} \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} A(s_t, a_t). The ratio term πθ(atst)πθref(atst)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} serves as an importance sampling weight that scales the advantage A(st,at)A(s_t, a_t) to account for the difference between the active policy being optimized and the reference baseline.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences