1Cademy - Tuning an RLHF + PPO Update When Reward Improves but Behavior Regresses

Learn Before

Case Study

Tuning an RLHF + PPO Update When Reward Improves but Behavior Regresses

You are fine-tuning a customer-support LLM using RLHF. Humans provide pairwise preferences between two candidate answers per prompt, and you train a reward model to score answers so that preferred answers get higher scores (i.e., reward model training is a ranking problem). You then optimize the LLM policy with PPO using a policy-gradient-style objective that weights log-probability changes by an advantage estimate, and you include a KL-divergence penalty to keep the policy close to a frozen reference model (the pre-RLHF model).

After a new PPO training run, offline evaluation shows the average reward-model score increased by +18%, but production monitoring shows two regressions: (1) the model’s tone becomes noticeably more verbose and salesy compared to the reference, and (2) refusal/safety behavior becomes less consistent. You inspect a batch of PPO updates and see that many sampled responses have large positive advantages, and the ratio between new-policy and old-policy token probabilities often exceeds the PPO clip range before clipping is applied. The measured KL divergence to the reference also rises sharply early in training.

As the on-call ML lead, propose ONE concrete change to the PPO optimization setup (not “collect more data”) that is most likely to address the regressions while preserving most of the reward gain. In your answer, explain the causal chain using: (a) how the reward model’s ranking-based training affects what the reward signal represents, (b) how the advantage-weighted policy gradient in PPO pushes probability mass, and (c) how the KL-divergence penalty interacts with PPO’s clipping to constrain (or fail to constrain) policy drift from the reference.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related