Consider the following equation that defines a target policy based on a reference policy , a reward function , a positive scaling parameter , and a normalization term : True or False: If the reward function is equal to zero for all possible outputs given an input , the target policy will be identical to the reference policy .
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Reward Function in Terms of Policy Models and Normalization Factor
In a particular policy optimization framework, the target policy, denoted as , is determined by the following relationship involving a reference policy , a reward function , a positive temperature parameter , and a normalization term : Given this formula, what is the primary effect of significantly increasing the reward for a single, specific output , while keeping all other factors constant?
Consider the following equation that defines a target policy based on a reference policy , a reward function , a positive scaling parameter , and a normalization term : True or False: If the reward function is equal to zero for all possible outputs given an input , the target policy will be identical to the reference policy .
Impact of the Scaling Parameter on Policy Behavior