1Cademy - Consequences of Modifying the PPO Objective Function

Learn Before

Overall PPO Objective Function for Language Models

Short Answer

Consequences of Modifying the PPO Objective Function

A researcher is training a language model using an objective function that combines a clipped surrogate objective (which encourages high rewards) with a policy divergence penalty, controlled by a coefficient β. If the researcher decides to set β to zero for the entire training process, what are the likely consequences for the model's generated text? Describe both a potential short-term benefit and a significant long-term drawback.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related