Concept

Baseline Method for Policy Gradient Variance Reduction

A straightforward technique to lower the variance of the policy gradient is to introduce a baseline, denoted as bb. This baseline acts as a reference point and is subtracted from the total reward, modifying the term to โˆ‘t=1Trtโˆ’b\sum_{t=1}^{T} r_t - b. By centering the rewards around this baseline (e.g., if bb is defined as the expected value of the total reward, this operation centers the rewards around zero), we remove systematic biases in the reward signal. This makes the learning updates more stable and less sensitive to extreme fluctuations in individual rewards without introducing bias.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related