Concept

Stabilizing Policy Updates with a Divergence Penalty

By incorporating a policy divergence penalty into the optimization objective, the learning process is stabilized. This penalty discourages the current policy from straying too far from a reference policy, thereby limiting excessively large updates that could disrupt training.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences