Multiple Choice

In policy gradient methods, the gradient of the log-probability of a trajectory is initially expressed as the sum of two components: one related to the agent's actions and another related to the environment's transitions. The expression is then simplified by removing the environment's component before optimization. Given the initial expression: θ[t=1Tlogπθ(atst)+t=1TlogPr(st+1st,at)]\frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) + \sum_{t=1}^{T} \log \text{Pr}(s_{t+1}|s_t, a_t) \right] What is the fundamental assumption that justifies simplifying this to just the policy component, θt=1Tlogπθ(atst)\frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science