1Cademy - Empirical Pair-wise Ranking Loss for RLHF Reward Model

Learn Before

Pair-wise Ranking Loss Formula for RLHF Reward Model

Formula

Empirical Pair-wise Ranking Loss for RLHF Reward Model

The reward model in RLHF is trained by minimizing an empirical pair-wise ranking loss, which is calculated as an average over the human preference dataset. This loss function encourages the model to assign a higher score to a preferred response ( $y_a$ ) over a less preferred one ( $y_b$ ) for the same input prompt. The formula, which is based on the Bradley-Terry model, is: