Learn Before
During the alignment of a language model using a preference-based optimization method, a crucial assumption is made that both the underlying reward function and a reference version of the model are held constant. What is the most direct and significant consequence of this assumption for the optimization process?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Comparison of DPO's Fixed Model Assumption with PPO
During the alignment of a language model using a preference-based optimization method, a crucial assumption is made that both the underlying reward function and a reference version of the model are held constant. What is the most direct and significant consequence of this assumption for the optimization process?
Analyzing the Fixed Model Assumption in Policy Optimization
The fixed model assumption in a preference-based optimization framework implies that the process adjusts the parameters of the reward model, the reference policy, and the target policy in a coordinated manner to maximize alignment with human preferences.