1Cademy - Interpreting Conflicting RLHF Signals: Reward Model Ranking vs. PPO Updates Under KL Regularization

Learn Before

Essay

Interpreting Conflicting RLHF Signals: Reward Model Ranking vs. PPO Updates Under KL Regularization

You are leading an RLHF fine-tuning effort for a customer-support LLM. Humans provide pairwise rankings of candidate responses per prompt, and you train a reward model to score responses so that preferred responses get higher scores (i.e., reward model training is a ranking problem). You then optimize the policy with PPO using a policy-gradient-style objective weighted by an advantage estimate, while also applying a KL-divergence penalty to keep the policy close to a frozen reference model.

After several PPO iterations, offline evaluation shows a puzzling pattern on a held-out set of prompts: (1) the reward model assigns higher scores to the new policy’s sampled responses than to the reference model’s responses, but (2) human spot-checkers say the new policy is noticeably more verbose and sometimes less directly helpful than the reference, and (3) the average KL divergence from the reference is increasing even though you have a nonzero KL penalty.

Write an analysis that proposes a coherent, end-to-end explanation for how all three observations can be simultaneously true. In your answer, explicitly connect: (a) how a ranking-trained reward model can be systematically biased or exploited, (b) how PPO’s clipped surrogate objective and the policy-gradient objective with advantage can still push probability mass toward these behaviors, and (c) how the KL penalty term interacts with the PPO update (including what it is actually penalizing in terms of log-probabilities) and why it might fail to prevent drift in this situation. Conclude by recommending two concrete changes (e.g., to data collection, reward model training, or PPO/KL settings) and justify the tradeoffs each change introduces.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related