1Cademy - A reward model is being trained using a loss function calculated as the negative log of a sigmoid function applied to the difference in scores between a preferred response ($y_a$) and a rejected response ($y_b$). For a single training instance, the model outputs a score of $r(y_a) = 2.0$ for the preferred response and $r(y_b) = 3.0$ for the rejected response. How will this specific outcome influence the models parameter update for this step?

Learn Before

Empirical Reward Model Loss Formula using Bradley-Terry Model

Multiple Choice

A reward model is being trained using a loss function calculated as the negative log of a sigmoid function applied to the difference in scores between a preferred response ( $y_a$ ) and a rejected response ( $y_b$ ). For a single training instance, the model outputs a score of $r(y_a) = 2.0$ for the preferred response and $r(y_b) = 3.0$ for the rejected response. How will this specific outcome influence the model's parameter update for this step?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related