Interpreting Reward Model Scores
Based on the mathematical model that calculates the probability of one response being preferred over another using the sigmoid of the difference in their scores, evaluate the junior data scientist's conclusion. Is their reasoning correct? Explain why or why not, focusing on how the model interprets these scores.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A system models human preference between two generated responses, A and B, for a given prompt. It does this by first assigning a numerical reward score to each response, r(A) and r(B). The probability that response A is preferred over B is then calculated as Sigmoid(r(A) - r(B)). Based on this model, what happens to the predicted probability of preferring response A as the difference r(A) - r(B) becomes a very large positive number?
Interpreting Reward Model Scores
A preference model calculates the probability of response 'a' being preferred over response 'b' using their respective reward scores, r(a) and r(b). The initial formula is given as: P(a > b) = exp(r(a)) / (exp(r(a)) + exp(r(b))). Arrange the following algebraic steps in the correct order to simplify this expression into the form Sigmoid(r(a) - r(b)).