1Cademy - Analyzing Reward Model Loss Behavior

Learn Before

Empirical Pair-wise Ranking Loss for RLHF Reward Model

Short Answer

Analyzing Reward Model Loss Behavior

A reward model is trained to prefer response $y_a$ over $y_b$ for a given prompt, using the loss function $\mathcal{L} = - \log \sigma(r(y_a) - r(y_b))$ , where $r$ is the score and $\sigma$ is the sigmoid function. Suppose that for a particular training instance, the model incorrectly assigns a much higher score to the dispreferred response, such that the score difference $r(y_a) - r(y_b)$ is a large negative number (e.g., -5). Describe the resulting value of the loss for this instance (e.g., close to zero, large and positive, etc.) and explain your reasoning by tracing the calculation through the sigmoid and negative log functions.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related