Short Answer

Policy Gradient Optimization

In the context of calculating the gradient for a policy update at a specific time t, the full sum of rewards for an episode (∑_{k=1}^{T} r_k) is often used to weight the score function ∇_θ log π_θ(a_t|s_t). However, a more refined approach replaces the full sum of rewards with only the sum of rewards from time t onwards (∑_{k=t}^{T} r_k). Analyze this modification by explaining both the conceptual principle and the mathematical property that justify ignoring the sum of past rewards (∑_{k=1}^{t-1} r_k) in this calculation.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science