Learn Before
Modeling Pairwise Preference Probability with a Reward Function
The probability that a response is preferred over another response given an input is modeled using a learned reward function . This is achieved by applying the sigmoid function to the difference between the reward scores of the two responses, as specified by the Bradley-Terry model. The formula is: . This is a foundational component for training reward models in RLHF.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Bradley-Terry Model for Pairwise Preference Probability
Ranking Chatbot Responses
A user provides the prompt, denoted as 'x', 'Translate the phrase "hello world" into French.' to a language model. The model generates two responses: Response A ('y_A'), which is 'Bonjour le monde', and Response B ('y_B'), which is 'Salut monde'. A human evaluator indicates that Response A is a better translation than Response B. Which of the following expressions correctly represents the probability of this specific preference, given the user's prompt?
Modeling Pairwise Preference Probability with a Reward Function
Interpreting Preference Probability Notation
Learn After
Listwise Loss Formula from Accumulated Pairwise Comparisons
Empirical Reward Model Loss Formula
Empirical Formulation of Pair-wise Ranking Loss
A system learns a function,
r(input, response), that assigns a numerical score indicating the quality of aresponsefor a giveninput. The probability that responseY_ais preferred over responseY_bis then calculated using the formula:Probability = Sigmoid(r(input, Y_a) - r(input, Y_b)), whereSigmoid(z) = 1 / (1 + e^-z). Given the following scenarios for a single input, which one presents a logical inconsistency between the assigned scores and the resulting preference probability?Preference Probability Calculation
Invariance of Preference Probability