1Cademy - Continuous Supervision from the RLHF Reward Model

Mechanism A: A classifier that labels each generated response as either &#x27;Good&#x27; or &#x27;Bad&#x27;.
Mechanism B: A scoring model that assigns each generated response a numerical score from 1 to 10, representing its degree of quality.

Learn Before

Dual Role of the RLHF Reward Model: Ranking-based Training for Scoring Application

Concept

Continuous Supervision from the RLHF Reward Model

A key advantage of using the reward model for scoring is its ability to provide continuous supervision signals. Unlike a binary "good" or "bad" classification, the scalar score offers a nuanced gradient of quality, which is highly beneficial for effectively training other models, such as the policy model in the subsequent RL stage.

Updated 2026-04-20

Contributors are: