Concept

Baseline's Role in Centering Rewards and Reducing Gradient Variance

The subtraction of a baseline, bb, from the total reward, t=1Trt\sum_{t=1}^{T} r_t, serves to center the reward values. For instance, if the baseline is defined as the expected total reward, this operation centers the rewards around zero. This centering is the direct mechanism for variance reduction, as it stabilizes the value of the product term, (t=1Tlogπθ(atst))(t=1Trtb)(\sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t))(\sum_{t=1}^{T} r_t - b), which is used to estimate the policy gradient.

Image 0

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences