1Cademy - In the derivation of a policy-based reinforcement learning algorithm, the gradient of the log-probability of a trajectory τ (a sequence of states and actions) with respect to policy parameters θ is transformed as shown below: Initial form: `∂/∂θ log [ Π_t (π_θ(a_t|s_t) * P(s_{t+1}|s_t, a_t)) ]` Decomposed form: `∂/∂θ Σ_t log π_θ(a_t|s_t) + ∂/∂θ Σ_t log P(s_{t+1}|s_t, a_t)` By analyzing the components of the decomposed form, what is the most significant implication for the learning algorithm?

Learn Before

Decomposition of the Trajectory Log-Probability Gradient

Multiple Choice

In the derivation of a policy-based reinforcement learning algorithm, the gradient of the log-probability of a trajectory τ (a sequence of states and actions) with respect to policy parameters θ is transformed as shown below:

Initial form: ∂/∂θ log [ Π_t (π_θ(a_t|s_t) * P(s_{t+1}|s_t, a_t)) ]

Decomposed form: ∂/∂θ Σ_t log π_θ(a_t|s_t) + ∂/∂θ Σ_t log P(s_{t+1}|s_t, a_t)

By analyzing the components of the decomposed form, what is the most significant implication for the learning algorithm?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related