True/False

Consider the formula for the policy gradient estimate with a baseline: J(θ)θ=1DτD(θt=1Tlogπθ(atst))(t=1Trtb)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|D|} \sum_{\tau \in D} \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right) According to this formula, the baseline value b is subtracted from the reward r_t at each individual timestep t within a trajectory to reduce variance.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science