1Cademy - A reward model is being trained using an objective function that aims to maximize the value of $L = -(s - r)^2$ for each data point, where s is a fixed target score and r is the models predicted reward. For a particular data point, the target score s is 5.0. The model currently predicts a reward r of 4.5. How would the value of L for this data point change if the models prediction were updated to 5.5?

Learn Before

Pointwise Rating Loss (L_rating) Formula

Multiple Choice

A reward model is being trained using an objective function that aims to maximize the value of $L = -(s - r)^2$ for each data point, where 's' is a fixed target score and 'r' is the model's predicted reward. For a particular data point, the target score 's' is 5.0. The model currently predicts a reward 'r' of 4.5. How would the value of L for this data point change if the model's prediction were updated to 5.5?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related