1Cademy - Stabilizing Policy Gradient Training

Learn Before

Clipped Utility Function with Upper-Bound Clipping

Case Study

Stabilizing Policy Gradient Training

Based on the provided case study, explain how implementing a utility function that clips the policy probability ratio with only an upper bound (min(ratio, 1+ε)) would address the observed training instability.

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective
PPO Clipped Objective for Language Models
A reinforcement learning agent is being trained using a utility function that incorporates an upper-bound clip on the policy probability ratio, defined as min(ratio, 1+ε), where ε is a small positive constant. Consider two distinct actions taken during an episode:
- Action A: Has a large positive advantage, and its probability ratio is 2.0.
- Action B: Has a large negative advantage, and its probability ratio is 0.1.
Assuming ε = 0.2, how does this specific clipping mecha
A utility function that modifies the policy probability ratio r_t using the operation min(r_t, 1+ε) is primarily intended to mitigate training instability caused by actions that are discovered to be substantially worse than the reference policy's actions (i.e., actions with a large negative advantage).
Stabilizing Policy Gradient Training

Learn Before

Related