Learn Before
Reference Policy and Model Probability
In a system that learns from human feedback, a 'reference model' with a fixed set of parameters, , is used to generate a probability distribution, . Explain the precise relationship between this probability distribution and the system's 'reference policy', denoted as .
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a reinforcement learning process that uses human feedback, a 'reference model' with a fixed set of parameters, , is used as a baseline. For a specific input prompt, this model calculates that the probability of generating the word 'consequently' as the next word is 0.04. Given that the reference policy, , is formally defined as the probability distribution generated by this reference model, what is the value of ?
True or False: In a Reinforcement Learning from Human Feedback (RLHF) system, the reference policy is a function that is trained to approximate the probability distribution generated by the reference model.
Reference Policy and Model Probability