1Cademy - A reward model is being trained using a pair-wise ranking loss function. For a given prompt `x`, the preference dataset contains a pair of responses: a preferred response `y_pref` and a rejected response `y_rej`. Initially, the model assigns the following scores: `R(x, y_pref) = 2.0` and `R(x, y_rej) = 3.0`. Based on the objective of the loss function, what is the most likely change to these scores after a single optimization step on this data point?

Learn Before

Empirical Formulation of Pair-wise Ranking Loss

Multiple Choice

A reward model is being trained using a pair-wise ranking loss function. For a given prompt x, the preference dataset contains a pair of responses: a preferred response y_pref and a rejected response y_rej. Initially, the model assigns the following scores: R(x, y_pref) = 2.0 and R(x, y_rej) = 3.0. Based on the objective of the loss function, what is the most likely change to these scores after a single optimization step on this data point?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related