Short Answer

Analyzing Reward Model Loss Behavior

A reward model is trained to prefer response yay_a over yby_b for a given prompt, using the loss function L=logσ(r(ya)r(yb))\mathcal{L} = - \log \sigma(r(y_a) - r(y_b)), where rr is the score and σ\sigma is the sigmoid function. Suppose that for a particular training instance, the model incorrectly assigns a much higher score to the dispreferred response, such that the score difference r(ya)r(yb)r(y_a) - r(y_b) is a large negative number (e.g., -5). Describe the resulting value of the loss for this instance (e.g., close to zero, large and positive, etc.) and explain your reasoning by tracing the calculation through the sigmoid and negative log functions.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science