1Cademy - A system models human preference between two generated responses, A and B, for a given prompt. It does this by first assigning a numerical reward score to each response, r(A) and r(B). The probability that response A is preferred over B is then calculated as Sigmoid(r(A) - r(B)). Based on this model, what happens to the predicted probability of preferring response A as the difference r(A) - r(B) becomes a very large positive number?

Learn Before

Derivation of the Bradley-Terry Preference Formula

Multiple Choice

A system models human preference between two generated responses, A and B, for a given prompt. It does this by first assigning a numerical reward score to each response, r(A) and r(B). The probability that response A is preferred over B is then calculated as Sigmoid(r(A) - r(B)). Based on this model, what happens to the predicted probability of preferring response A as the difference r(A) - r(B) becomes a very large positive number?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related