1Cademy - Squared Sum of Rewards Regularization

Learn Before

Regularization in RLHF Reward Model Training

Formula

Squared Sum of Rewards Regularization

To make the supervision signal for training the reward model more robust, a regularization term based on the squared sum of rewards can be added to the pairwise comparison loss in RLHF. This regularization term helps mitigate the underdetermination of reward models. The regularized loss function is formulated as: $\mathcal{L}_{\mathrm{reg}} = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big] -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ r(\mathbf{x},\mathbf{y}_a) + r(\mathbf{x},\mathbf{y}_b) \big]^2$ .

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related