Reward Function in Terms of Policy Models and Normalization Factor
By rearranging the equation for the optimal target policy, the underlying reward function can be expressed solely using the target model , the reference model , and the normalization factor . This is a profound shift because, although the initial goal was to learn a policy using a given reward model, it leads to a representation of the reward model derived entirely from the policy. The resulting formula is:
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reward Function in Terms of Policy Models and Normalization Factor
In a particular policy optimization framework, the target policy, denoted as , is determined by the following relationship involving a reference policy , a reward function , a positive temperature parameter , and a normalization term : Given this formula, what is the primary effect of significantly increasing the reward for a single, specific output , while keeping all other factors constant?
Consider the following equation that defines a target policy based on a reference policy , a reward function , a positive scaling parameter , and a normalization term : True or False: If the reward function is equal to zero for all possible outputs given an input , the target policy will be identical to the reference policy .
Impact of the Scaling Parameter on Policy Behavior
Learn After
In a policy-based language model alignment process, the reward
r(x, y)for a responseyto a promptxis defined by the equation: whereπ_θis the target policy,π_θ_refis the reference policy,βis a positive scaling factor, andZ(x)is a normalization factor. If, for a specific responsey_1, the target policy assigns a lower probability than the reference policy (i.e.,π_θ(y_1|x) < π_θ_ref(y_1|x)), what is the direct consequence for the log-ratio component of the reward calculation?In a framework for aligning language models, a reward function is defined as: where is the target policy, is a reference policy, is a scaling factor, and is a normalization factor dependent on the prompt . Given two distinct responses, and , to the same prompt , which expression correctly represents the difference in their rewards, ?
Derivation of DPO Preference Probability from Policy Ratios
Analysis of Reward Function under Policy Convergence