When comparing Direct Policy Optimization (DPO) with Proximal Policy Optimization (PPO), what is the primary consequence of DPO's foundational assumption that the reward and reference models are fixed throughout training?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing Trade-offs in Policy Optimization for Language Models
When comparing Direct Policy Optimization (DPO) with Proximal Policy Optimization (PPO), what is the primary consequence of DPO's foundational assumption that the reward and reference models are fixed throughout training?
Analyzing the Simplification in Direct Policy Optimization