Formula

Hinge Loss Formula for Segment-Based Reward Model Training

The hinge loss is a max-margin loss function used for training binary classification models. In the context of segment-based reward modeling, it is formulated as:

Lhinge=max(0,1r(x,y,yˉk)r^)\mathcal{L}_{\mathrm{hinge}} = \max(0, 1 - r(\mathbf{x}, \mathbf{y}, \bar{\mathbf{y}}_k) \cdot \hat{r})

In this formula, r(x,y,yˉk)r(\mathbf{x}, \mathbf{y}, \bar{\mathbf{y}}_k) represents the score assigned by the reward model to the segment yˉk\bar{\mathbf{y}}_k. The term r^\hat{r} is the ground-truth label for the segment, typically encoded as +1+1 for one class (e.g., 'ethical') and 1-1 for the other (e.g., 'unethical'). The loss is zero if the model's prediction has the correct sign and a margin of at least 1{}1; otherwise, the loss is proportional to the distance from the margin.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences