1Cademy - Dual Role of the RLHF Reward Model: Ranking-based Training for Scoring Application

Learn Before

Reward Model Learning in RLHF

Concept

Dual Role of the RLHF Reward Model: Ranking-based Training for Scoring Application

The reward model in RLHF has a dual function. During training, it is optimized using a pairwise ranking objective, which makes it highly sensitive to subtle differences between various outputs. In its application phase, however, it is used to assign an independent, continuous scalar score to each input-output pair. This transition from a relative comparison (ranking) to an absolute evaluation (scoring) provides the nuanced, continuous feedback needed to effectively guide the optimization of the LLM.

Updated 2026-05-01

Contributors are: