1Cademy - Reward Model Training as a Ranking Problem in RLHF

Model A: score(x, y_preferred) = 3.2 , score(x, y_rejected) = 1.5
Model B: score(x, y_preferred) = -0.5 , score(x, y_rejected) = -2.0

Learn Before

Reward Model Learning in RLHF
Pairwise Comparison for Human Feedback in RLHF

Concept

Reward Model Training as a Ranking Problem in RLHF

In RLHF, the training of the reward model is framed as a ranking problem. The goal is to teach the model to assign numerical scores to different outputs in a way that the order of these scores reflects the preferences provided by human annotators. While there are several methods to approach this from a ranking perspective, the objective is typically achieved by minimizing a ranking loss function. This function penalizes the model for incorrect orderings and encourages it to assign higher scores to preferred responses over less preferred ones.

Updated 2026-05-02

Contributors are: