1Cademy - RLHF Policy Optimization Objective

Learn Before

Formula

RLHF Policy Optimization Objective

The goal of the policy training stage in Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters $\tilde{\theta}$ that maximize expected reward without deviating too far from a reference policy. The training objective evaluates the quality of an output $\mathbf{y}$ given an input $\mathbf{x}$ using a reward model $r(\mathbf{x},\mathbf{y})$ . The objective minimizes the negative reward (loss) and includes a penalty for policy divergence:

$\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]$

Here, the penalty regularizes the current policy $\pi_{\theta}$ against the reference policy $\pi_{\theta_{\mathrm{ref}}}$ using a coefficient $\beta$ .

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After