1Cademy - An alignment algorithm calculates the probability of a preferred response `y_a` over a dispreferred response `y_b` for a given prompt `x` using the following expression:<br><br>`Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )`<br><br>Based on a direct analysis of this expression, which of the following components is *not* explicitly required to compute this probability during the training process?

Learn Before

Elimination of the Reward Model in DPO

Multiple Choice

An alignment algorithm calculates the probability of a preferred response y_a over a dispreferred response y_b for a given prompt x using the following expression:

Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )

Based on a direct analysis of this expression, which of the following components is not explicitly required to compute this probability during the training process?

Updated 2025-10-06

Contributors are: