1Cademy - Post-Deployment Drift After RLHF: Diagnosing Reward Model and PPO/KL Interactions

Learn Before

Case Study

Post-Deployment Drift After RLHF: Diagnosing Reward Model and PPO/KL Interactions

You are on an applied LLM team fine-tuning a customer-support assistant using RLHF with PPO. Human labelers provide pairwise preferences between two candidate responses per prompt, and you train a reward model from these rankings. In policy optimization, you maximize a PPO-style objective that uses an advantage estimate and includes a KL-divergence penalty to keep the updated policy close to a frozen reference model.

After several training iterations, offline evaluation shows the reward model score is steadily increasing, but a targeted audit finds the assistant is drifting into a “corporate-sounding” style that is overly verbose and sometimes avoids directly answering. The drift is most pronounced on prompts where the reference model would answer briefly. You inspect a batch of PPO training data and see many sampled responses where:

the reward model assigns a slightly higher score to the verbose response than to a concise, correct response,
the KL penalty for the verbose response is large (because the reference model assigns it very low probability),
the computed advantage values for tokens in the verbose response are still positive overall.

As the person responsible for stabilizing training, explain (1) the most plausible mechanism that allows PPO to keep increasing the probability of these verbose responses despite the KL penalty, and (2) one concrete change you would make to either the reward-model training setup (as a ranking problem) or the PPO/KL configuration to reduce this drift—justify your choice in terms of how it would change the advantage-weighted policy gradient update and/or the effective reward signal.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related