Formula

Simplification of the Trajectory Log-Probability Gradient

After decomposing the trajectory log-probability gradient, it is typical in reinforcement learning settings to assume the environment's dynamics are not directly influenced by the policy parameters θ\theta. Consequently, the derivative of the dynamics gradient, θt=1TlogPr(st+1st,at)\frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \Pr(s_{t+1}|s_t,a_t), is usually zero. We can therefore simplify the overall gradient to focus entirely on optimizing the policy component: logPrθ(τ)θ=θt=1Tlogπθ(atst)\frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) This simplification allows for concentrating solely on policy updates without the need to understand or model the underlying environmental dynamics.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences