Policy Divergence Penalty for Language Models
The penalty term in PPO for language models quantifies the divergence between the current policy and a reference policy . It is defined as the difference in the log-probabilities of generating the response given the prompt : For autoregressive language models, this can be decomposed exactly into a sum over the tokens in the sequence:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Approximation of the Policy Divergence Penalty
Policy Divergence Penalty for Language Models
In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
Calculating Policy Divergence Penalty
Interpreting Policy Divergence
RLHF Policy Optimization Objective
Policy Divergence Penalty for Language Models
KL-Divergence Penalty in RLHF Policy Optimization
An AI development team is fine-tuning a language model using a reinforcement learning process guided by a reward model. They observe that the model's outputs, while receiving high scores from the reward model, are becoming stylistically unnatural and deviating significantly from the helpful tone established during its initial supervised training. Which of the following adjustments to the training process is most specifically designed to counteract this behavioral drift?
Diagnosing and Mitigating Reward Hacking
Consequences of Omitting a Reference Policy in RLHF
Learn After
PPO Objective Formula for LLM Training in RLHF
An autoregressive language model is generating the two-token response 'Good day' given a prompt. The table below shows the per-token log-probabilities from the current policy being trained () and a fixed reference policy (). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token.
| Token | | | | :--- | :---: | :---: | | 'Good' | -0.8 | -1.5 | | 'day' | -0.4 | -2.1 |
Based on this data, what can be concluded about the current policy's behavior for this specific generation?
Diagnosing Training Issues with Policy Divergence
Overall PPO Objective Function for Language Models
Interpreting the Policy Divergence Penalty