1Cademy - A language model alignment method re-expresses the probability of a preferred response (y_a) over a dispreferred response (y_b) for a given prompt (x) as follows:<br><br>`Pr(y_a ≻ y_b | x) = Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )`<br><br>Where `π_θ` is the policy being trained and `π_ref` is a fixed reference policy. Based on this mathematical formulation, what is the primary reason this method can be trained without an explicit, separately-trained reward model?

Learn Before

Elimination of the Reward Model in DPO

Multiple Choice

A language model alignment method re-expresses the probability of a preferred response (y_a) over a dispreferred response (y_b) for a given prompt (x) as follows:

Where π_θ is the policy being trained and π_ref is a fixed reference policy. Based on this mathematical formulation, what is the primary reason this method can be trained without an explicit, separately-trained reward model?

Updated 2025-09-26

Contributors are: