Learn Before
Interpreting the Policy Divergence Penalty
Explain the primary purpose of using a policy divergence penalty when fine-tuning a language model. In your explanation, describe how this penalty is calculated at the token level for an autoregressive model and what a consistently high positive penalty value suggests about the relationship between the current and reference policies.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective Formula for LLM Training in RLHF
An autoregressive language model is generating the two-token response 'Good day' given a prompt. The table below shows the per-token log-probabilities from the current policy being trained () and a fixed reference policy (). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token.
| Token | | | | :--- | :---: | :---: | | 'Good' | -0.8 | -1.5 | | 'day' | -0.4 | -2.1 |
Based on this data, what can be concluded about the current policy's behavior for this specific generation?
Diagnosing Training Issues with Policy Divergence
Overall PPO Objective Function for Language Models
Interpreting the Policy Divergence Penalty