Learn Before
Analyzing a Modified Policy Optimization Objective
Analyze the most likely outcome of the experiment described in the case study. Explain why this outcome would occur by referencing the roles of the different components in the policy optimization objective.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective for LLM Training
Derivation of the KL Divergence Objective for Policy Optimization
During the policy optimization stage of training a large language model, an engineer observes that the model's outputs are coherent and safe, but they show very little improvement over the initial supervised fine-tuned version and consistently receive mediocre scores from the reward model. Which of the following is the most likely cause of this issue, based on the policy optimization objective function that balances maximizing rewards with a penalty for policy divergence?
Analyzing the Trade-off in Policy Optimization
Analyzing a Modified Policy Optimization Objective