1Cademy - Diagnosing Reward Model Training Issues

Learn Before

Pair-wise Ranking Loss Formula for RLHF Reward Model

Case Study

Diagnosing Reward Model Training Issues

Based on the components of the pair-wise ranking loss formula, explain why this specific behavior (assigning similar scores to both responses) results in a high loss value and prevents the model from learning the desired preferences.

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Empirical Formulation of Pair-wise Ranking Loss
Empirical Pair-wise Ranking Loss for RLHF Reward Model
Regularized Pairwise Loss Function for Reward Model Training
A reward model is being trained to prefer one machine-generated text response over another for a given input. The training process aims to minimize a loss function calculated as the negative logarithm of a sigmoid applied to the difference between the reward scores of the preferred ( $R_{pref}$ ) and non-preferred ( $R_{non-pref}$ ) responses. Given the following reward scores assigned by the model to a single pair of responses, which scenario contributes the least to the total loss, indicating th
Diagnosing Reward Model Training Issues
Analyzing Reward Model Performance via Loss Function

Learn Before

Related