1Cademy - PPO Objective Formula for LLM Training in RLHF

Learn Before

Formula

PPO Objective Formula for LLM Training in RLHF

The policy in RLHF is updated by minimizing the Proximal Policy Optimization (PPO) loss. This objective function combines a clipped surrogate objective, which uses the advantage function $A_t$ , with a penalty term to prevent large deviations from the reference policy ( $P_{r_{\theta_{old}}}$ ). The formula is expressed as:

$\min_{\theta} - \sum_{x \in D, y \sim P_{r_{\theta_{old}}}(\cdot|x)} \sum_{t=1}^{T} \left[ \text{Clip} \left( \frac{P_{r_{\theta}}(y_t|x, y_{<t})}{P_{r_{\theta_{old}}}(y_t|x, y_{<t})} \right) A_t - \beta \left( \log P_{r_{\theta}}(y_t|x, y_{<t}) - \log P_{r_{\theta_{old}}}(y_t|x, y_{<t}) \right) \right]$

This loss is minimized over all prompts $x$ in the dataset $D$ and for each token $t$ in the generated sequence $y$ . The term scaled by $\beta$ acts as a KL-divergence penalty to ensure training stability.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After