Multiple Choice

In the derivation of a policy-based reinforcement learning algorithm, the gradient of the log-probability of a trajectory τ (a sequence of states and actions) with respect to policy parameters θ is transformed as shown below:

Initial form: ∂/∂θ log [ Π_t (π_θ(a_t|s_t) * P(s_{t+1}|s_t, a_t)) ]

Decomposed form: ∂/∂θ Σ_t log π_θ(a_t|s_t) + ∂/∂θ Σ_t log P(s_{t+1}|s_t, a_t)

By analyzing the components of the decomposed form, what is the most significant implication for the learning algorithm?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science