1Cademy - An engineer is training a text-generation model using a reinforcement learning algorithm. They notice that the models performance is highly unstable: after a few successful updates, a single large update often causes the models output quality to degrade significantly. Which of the following mechanisms is specifically designed to prevent such large, destabilizing policy updates by limiting the magnitude of the change between the new and old policies at each step?

Learn Before

Proximal Policy Optimization (PPO)

Multiple Choice

An engineer is training a text-generation model using a reinforcement learning algorithm. They notice that the model's performance is highly unstable: after a few successful updates, a single large update often causes the model's output quality to degrade significantly. Which of the following mechanisms is specifically designed to prevent such large, destabilizing policy updates by limiting the magnitude of the change between the new and old policies at each step?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related