1Cademy - Decomposition of the Trajectory Log-Probability Gradient

Learn Before

Derivation of the Policy Gradient Objective Function

Formula

Decomposition of the Trajectory Log-Probability Gradient

By treating sequence generation as a Markov decision process, the gradient of the log-probability of a trajectory $\tau$ with respect to policy parameters $\theta$ can be decomposed into two distinct components: the policy gradient and the dynamics gradient. $\frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \underbrace{\log \pi_{\theta}(a_t|s_t)}_{\text{policy}} + \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \underbrace{\log \Pr(s_{t+1}|s_t,a_t)}_{\text{dynamics}}$ The policy component, $\log \pi_{\theta}(a_t|s_t)$ , reflects the log-probability of choosing action $a_t$ in state $s_t$ , parameterized by $\theta$ . The dynamics component, $\log \Pr(s_{t+1}|s_t,a_t)$ , captures the environment's transition probabilities to the next state $s_{t+1}$ , given the current state and action.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After