Case Study

Analyzing Reward Model Performance with Hinge Loss

You are training a reward model to classify text segments as either 'preferred' (label = +1) or 'dispreferred' (label = -1). The model's performance is measured using the loss function: Loss = max(0, 1 - (model_score * label)). You are evaluating the model on two 'dispreferred' segments:

  • Segment A receives a model_score of 0.5.
  • Segment B receives a model_score of -0.2.

Calculate the loss for both segments. Based on these loss values, on which segment is the model performing worse, and why does the loss function penalize it more?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science