Formula

Policy Learning Loss Function in RLHF

The loss function for the policy learning stage in RLHF is defined as the negative expected utility of the model's outputs. The objective is to find the policy parameters θ\theta that minimize this loss, which is equivalent to maximizing the expected utility. The formula is:

L(θ)=ExDEyπθ(x)[U(x,y;θ)]\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x}\sim\mathcal{D}} \mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})} [U(\mathbf{x}, \mathbf{y}; \theta)]

Where:

  • D\mathcal{D} denotes the input-only dataset.
  • yπθ(x)\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x}) signifies that the output y\mathbf{y} is sampled from the probability distribution defined by the language model's policy, πθ\pi_{\theta}, given the input x\mathbf{x}.
  • U(x,y;θ)U(\mathbf{x}, \mathbf{y}; \theta) is a utility function that scores the quality of the output y\mathbf{y} for the input x\mathbf{x}.
Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences