KL-Divergence Penalty in RLHF Policy Optimization
A penalty term is incorporated into the RLHF objective function to regularize the policy and prevent it from deviating excessively from a reference policy. This penalty is formulated as the difference between the log probabilities of a sequence under the current policy () and the reference policy (), summed over all tokens in the sequence. The formula is: .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
KL-Divergence Penalty in RLHF Policy Optimization
A team is fine-tuning a language model where the only goal is to adjust the model's parameters to maximize the average score from a fixed reward model. After many training iterations, the team observes that while the policy consistently achieves high reward scores, the generated text is becoming repetitive and stylistically unnatural. What is the most likely reason for this outcome, based on the optimization objective?
Diagnosing Undesirable Model Behavior
Match each mathematical component from the policy learning objective function with its conceptual role in the training process.
RLHF Policy Optimization Objective
Policy Divergence Penalty for Language Models
KL-Divergence Penalty in RLHF Policy Optimization
An AI development team is fine-tuning a language model using a reinforcement learning process guided by a reward model. They observe that the model's outputs, while receiving high scores from the reward model, are becoming stylistically unnatural and deviating significantly from the helpful tone established during its initial supervised training. Which of the following adjustments to the training process is most specifically designed to counteract this behavioral drift?
Diagnosing and Mitigating Reward Hacking
Consequences of Omitting a Reference Policy in RLHF
Learn After
Overall PPO Objective Function for Language Models
During the policy optimization phase of training a large language model, the model is being rewarded for providing detailed explanations. The 'reference policy' is a version of the model that typically gives concise, direct answers. The current policy generates two possible responses to a user's query:
Response A: 'Yes.' Response B: 'Affirmative, the data you have presented aligns with the expected parameters, and therefore, the conclusion you have reached is indeed correct and validated.'
Assuming the reference policy would have a very high probability of generating Response A and a near-zero probability of generating Response B, which response would incur a larger penalty term designed to prevent deviation from the reference policy, and why?
Consequences of Policy Regularization Strength
Analysis of the Policy Regularization Penalty
Your team is running RLHF for a customer-facing LL...
You’re running an RLHF fine-tuning job for an inte...
You are reviewing an RLHF training run for an inte...
Diagnosing Instability in an RLHF + PPO Training Run
Interpreting Conflicting RLHF Signals: Reward Model Ranking vs. PPO Updates Under KL Regularization
Choosing and Justifying an RLHF Objective Under Competing Product Constraints
Designing an RLHF Training Blueprint for a Regulated Customer-Support LLM
Tuning an RLHF + PPO Update When Reward Improves but Behavior Regresses
Post-Deployment Drift After RLHF: Diagnosing Reward Model and PPO/KL Interactions
Root-Cause Analysis of a “Reward Hacking” Spike During RLHF with PPO