Comparison

Baseline's Impact on Reward Variance vs. Gradient Estimate Variance

While introducing a baseline bb does not change the overall variance of the total rewards t=1Trt\sum_{t=1}^{T} r_t, it is crucial for reducing the variance of the gradient estimates. Subtracting the baseline from the total rewards reduces fluctuations around their mean, which makes the gradient estimates more stable. In general, this centers the rewards around zero, leading to reduced variance in the product t=1Tlogπθ(atst)(t=1Trtb)\sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left(\sum_{t=1}^{T} r_t - b\right).

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences