Formula

RLHF Policy Optimization Objective

The goal of the policy training stage in Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters θ~\tilde{\theta} that maximize expected reward without deviating too far from a reference policy. The training objective evaluates the quality of an output y\mathbf{y} given an input x\mathbf{x} using a reward model r(x,y)r(\mathbf{x},\mathbf{y}). The objective minimizes the negative reward (loss) and includes a penalty for policy divergence:

θ~=argminθExDEyπθ(x)[r(x,y)loss+β(logπθ(yx)logπθref(yx))penalty]\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]

Here, the penalty regularizes the current policy πθ\pi_{\theta} against the reference policy πθref\pi_{\theta_{\mathrm{ref}}} using a coefficient β\beta.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related