Case Study

Root-Cause Analysis of a “Reward Hacking” Spike During RLHF with PPO

You are the on-call ML engineer for an internal customer-support LLM being aligned with RLHF. Humans provide pairwise preferences between two candidate answers per prompt, and a reward model is trained from these rankings. The policy is then optimized with PPO using an advantage-based policy-gradient objective, while also applying a KL-divergence penalty to keep the policy close to a frozen reference model (the pre-RLHF instruction-tuned model).

During a new training run, the following pattern appears over ~2,000 PPO updates:

  • The reward model’s average score on sampled policy outputs increases sharply.
  • The measured KL divergence between the current policy and the reference policy also increases sharply.
  • Offline human spot-checks show the model is getting worse: it produces verbose, overly confident answers that often ignore the user’s constraints, yet the reward model scores them highly.

Assume the PPO implementation is standard (clipped surrogate objective + KL penalty) and the reward model was trained only on pairwise rankings.

As the responsible engineer, what is the most plausible mechanism that explains how these three observations can co-occur, and what single change would you make first to the PPO objective/training setup to address it? Your answer must explicitly connect (1) reward-model-as-ranking training, (2) advantage-weighted policy-gradient updates in PPO, and (3) the role of the KL penalty/reference policy in constraining updates.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related