1Cademy - A utility function that modifies the policy probability ratio `r_t` using the operation `min(r_t, 1+ε)` is primarily intended to mitigate training instability caused by actions that are discovered to be substantially worse than the reference policys actions (i.e., actions with a large negative advantage).

Action A: Has a large positive advantage, and its probability ratio is 2.0 .
Action B: Has a large negative advantage, and its probability ratio is 0.1 .

Learn Before

Clipped Utility Function with Upper-Bound Clipping

True/False

A utility function that modifies the policy probability ratio r_t using the operation min(r_t, 1+ε) is primarily intended to mitigate training instability caused by actions that are discovered to be substantially worse than the reference policy's actions (i.e., actions with a large negative advantage).

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related