Google

While introducing a baseline $$b$$ does not change the overall variance of the total rewards $$\sum_{t=1}^{T} r_t$$, it is crucial for reducing the variance of the gradient estimates. Subtracting the baseline from the total rewards reduces fluctuations around their mean, which makes the gradient estimates more stable. In general, this centers the rewards around zero, leading to reduced variance in the product $$\sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left(\sum_{t=1}^{T} r_t - b\right)$$.

Baseline's Impact on Reward Variance vs. Gradient Estimate Variance

An engineer training an agent with a policy gradient method notices that the learning process is unstable due to high variance in the gradient estimates. To address this, they introduce a baseline which is subtracted from the rewards. What is the expected statistical consequence of this modification?

An AI developer is using a policy gradient algorithm and decides to add a baseline to stabilize training. They observe that the variance of the policy gradient *estimates* decreases, but the variance of the *total rewards* per episode remains unchanged. Explain the statistical reasoning behind this observation.

Differential Impact of Baselines on Variance

Introducing a baseline into a policy gradient algorithm is an effective technique for reducing the variance of the total rewards collected by the agent, which in turn stabilizes the learning process.

Learn Before

Related