Formula

Decomposition of the Trajectory Log-Probability Gradient

By treating sequence generation as a Markov decision process, the gradient of the log-probability of a trajectory τ\tau with respect to policy parameters θ\theta can be decomposed into two distinct components: the policy gradient and the dynamics gradient. logPrθ(τ)θ=θt=1Tlogπθ(atst)policy+θt=1TlogPr(st+1st,at)dynamics\frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \underbrace{\log \pi_{\theta}(a_t|s_t)}_{\text{policy}} + \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \underbrace{\log \Pr(s_{t+1}|s_t,a_t)}_{\text{dynamics}} The policy component, logπθ(atst)\log \pi_{\theta}(a_t|s_t), reflects the log-probability of choosing action ata_t in state sts_t, parameterized by θ\theta. The dynamics component, logPr(st+1st,at)\log \Pr(s_{t+1}|s_t,a_t), captures the environment's transition probabilities to the next state st+1s_{t+1}, given the current state and action.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences