Short Answer

Evaluating a Policy Gradient Implementation

An AI researcher is implementing a policy gradient algorithm. They correctly identify that the update rule requires calculating the gradient of the log-probability of a trajectory, ∂/∂θ log Pr_θ(τ). In their code, they are attempting to build a complex, differentiable model of the environment's transition probabilities, Pr(s_{t+1}|s_t, a_t), to include its gradient in the final calculation. Based on the mathematical decomposition of the trajectory log-probability gradient, explain the fundamental flaw in the researcher's approach.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science