Formula

PPO Objective Formula for LLM Training in RLHF

The policy in RLHF is updated by minimizing the Proximal Policy Optimization (PPO) loss. This objective function combines a clipped surrogate objective, which uses the advantage function AtA_t, with a penalty term to prevent large deviations from the reference policy (PrθoldP_{r_{\theta_{old}}}). The formula is expressed as:

minθxD,yPrθold(x)t=1T[Clip(Prθ(ytx,y<t)Prθold(ytx,y<t))Atβ(logPrθ(ytx,y<t)logPrθold(ytx,y<t))]\min_{\theta} - \sum_{x \in D, y \sim P_{r_{\theta_{old}}}(\cdot|x)} \sum_{t=1}^{T} \left[ \text{Clip} \left( \frac{P_{r_{\theta}}(y_t|x, y_{<t})}{P_{r_{\theta_{old}}}(y_t|x, y_{<t})} \right) A_t - \beta \left( \log P_{r_{\theta}}(y_t|x, y_{<t}) - \log P_{r_{\theta_{old}}}(y_t|x, y_{<t}) \right) \right]

This loss is minimized over all prompts xx in the dataset DD and for each token tt in the generated sequence yy. The term scaled by β\beta acts as a KL-divergence penalty to ensure training stability.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related