Learn Before
A reward model is being trained using a pair-wise ranking loss function. For a given prompt x, the preference dataset contains a pair of responses: a preferred response y_pref and a rejected response y_rej. Initially, the model assigns the following scores: R(x, y_pref) = 2.0 and R(x, y_rej) = 3.0. Based on the objective of the loss function, what is the most likely change to these scores after a single optimization step on this data point?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A reward model is being trained using a pair-wise ranking loss function. For a given prompt
x, the preference dataset contains a pair of responses: a preferred responsey_prefand a rejected responsey_rej. Initially, the model assigns the following scores:R(x, y_pref) = 2.0andR(x, y_rej) = 3.0. Based on the objective of the loss function, what is the most likely change to these scores after a single optimization step on this data point?Analysis of a Weighted Ranking Loss
Handling Labeler Disagreement in Reward Modeling