Short Answer

Rationale for Using a Baseline in Policy Gradients

In the context of training a policy, consider a scenario where all rewards for a set of actions are positive (e.g., ranging from +10 to +20). Without a baseline, the term weighting the gradient update is always positive. Explain how subtracting a baseline (e.g., the average reward of +15) from the total reward helps to create a more stable and effective learning signal, specifically addressing how it reduces the variance of the gradient estimate.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science