Learn Before
Diagnosing Training Issues with Policy Divergence
Based on the provided case study, analyze the relationship between the observed model behavior and the consistently low policy divergence penalty. What is the most likely reason for the model's failure to generate novel, helpful responses despite high reward scores?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective Formula for LLM Training in RLHF
An autoregressive language model is generating the two-token response 'Good day' given a prompt. The table below shows the per-token log-probabilities from the current policy being trained () and a fixed reference policy (). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token.
| Token | | | | :--- | :---: | :---: | | 'Good' | -0.8 | -1.5 | | 'day' | -0.4 | -2.1 |
Based on this data, what can be concluded about the current policy's behavior for this specific generation?
Diagnosing Training Issues with Policy Divergence
Overall PPO Objective Function for Language Models
Interpreting the Policy Divergence Penalty