Formula

Squared Sum of Rewards Regularization

To make the supervision signal for training the reward model more robust, a regularization term based on the squared sum of rewards can be added to the pairwise comparison loss in RLHF. This regularization term helps mitigate the underdetermination of reward models. The regularized loss function is formulated as: Lreg=E(x,ya,yb)Dr[logPrϕ(yaybx)]E(x,ya,yb)Dr[r(x,ya)+r(x,yb)]2\mathcal{L}_{\mathrm{reg}} = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big] -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ r(\mathbf{x},\mathbf{y}_a) + r(\mathbf{x},\mathbf{y}_b) \big]^2.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences