1Cademy - Policy Learning Loss Function in RLHF

Learn Before

Formulating the Loss Function for Policy Learning in RLHF

Formula

Policy Learning Loss Function in RLHF

The loss function for the policy learning stage in RLHF is defined as the negative expected utility of the model's outputs. The objective is to find the policy parameters $\theta$ that minimize this loss, which is equivalent to maximizing the expected utility. The formula is:

$\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x}\sim\mathcal{D}} \mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})} [U(\mathbf{x}, \mathbf{y}; \theta)]$

Where:

$\mathcal{D}$ denotes the input-only dataset.
$\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})$ signifies that the output $\mathbf{y}$ is sampled from the probability distribution defined by the language model's policy, $\pi_{\theta}$ , given the input $\mathbf{x}$ .
$U(\mathbf{x}, \mathbf{y}; \theta)$ is a utility function that scores the quality of the output $\mathbf{y}$ for the input $\mathbf{x}$ .

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After