Multiple Choice

Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.

Derivation: Let the policy gradient objective be: J(θ)τ[(θt=1Tlogπθ(atst))(k=1Trkb)]J(\theta) \propto \sum_{\tau} \left[ \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{k=1}^{T} r_k - b \right) \right]

Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative: τ[θ(t=1Tlogπθ(atst)(k=1Trkb))]\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right]

Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t): τ[θt=1T(logπθ(atst)(k=1Trkb))]\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right]

Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t): τ[θt=1T(logπθ(atst)(k=1t1rk+k=t+1Trkb))]\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t+1}^{T} r_k - b \right) \right) \right]

Which step introduces an error into the derivation?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related