Simplified Notation for Preference Probability Models
In the context of preference modeling, the probability notation Pr(·) is often a shorthand. A more complete representation would be Pr^ϕ(·), where the superscript ϕ denotes the parameters of the underlying model (e.g., the reward model). However, this superscript is frequently omitted to maintain notational clarity and reduce clutter.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Pair-wise Ranking Loss Formula for RLHF Reward Model
Simplified Notation for Preference Probability Models
Reward Model Loss as Negative Log-Likelihood
Empirical Reward Model Loss Formula using Bradley-Terry Model
A system for evaluating generated text uses a scalar scoring function,
r(input, output), to assign a numerical score to each potential output. For a given input, 'Output A' receives a score of 2.0, and 'Output B' receives a score of -0.2. The system models the probability that one output is preferred over another using the sigmoid of the difference between their scores. Based on this model, what is the approximate probability that 'Output A' is preferred over 'Output B'?Impact of Score Transformation on Preference Probabilities
Derivation of the Bradley-Terry Preference Formula
Omission of Parameter Superscript in Probability Notation
A preference model calculates the probability that output Y_a is preferred over output Y_b by applying the sigmoid function to the difference in their scalar scores,
score(Y_a) - score(Y_b). If the initial scores for Y_a and Y_b result in a preference probability greater than 50% but less than 100%, which of the following transformations to the scores is guaranteed to leave this probability unchanged?
Learn After
Interpreting Preference Model Notation
A research lab trains two different preference models, Model A and Model B, on the exact same dataset of human choices. When evaluating a specific input, they find that for a pair of outputs (Y_1, Y_2), Model A calculates the probability that Y_1 is preferred over Y_2 as 0.8. However, Model B calculates this same probability as 0.6. Both labs report their finding using the notation
Pr(Y_1 ≻ Y_2 | input). What is the most accurate explanation for this discrepancy?Evaluating Notational Simplification in Preference Models