Formula

Objective Function for Policy Learning in RLHF

The objective in the policy learning phase of Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters, denoted as θ~\tilde{\theta}, that maximize the expected reward. The optimization process starts with the parameters of a pre-trained model, θ^+\hat{\theta}^{+}, and seeks to maximize the reward assigned by a learned reward model, Rω^R_{\hat{\omega}}. The formal expression is:

θ~=argmaxθ^+E(x,yθ^+)DrlftRω^(x,yθ^+)\tilde{\theta} = \arg \max_{\hat{\theta}^{+}} \mathbb{E}_{(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}}) \sim \mathcal{D}_{\text{rlft}}} R_{\hat{\omega}}(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}})

Here:

  • θ~\tilde{\theta} are the optimized policy parameters.
  • argmaxθ^+\arg \max_{\hat{\theta}^{+}} indicates that we are searching for the parameters that maximize the objective, starting from the initial parameters θ^+\hat{\theta}^{+}.
  • E(x,yθ^+)Drlft\mathbb{E}_{(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}}) \sim \mathcal{D}_{\text{rlft}}} represents the expected value over the dataset Drlft\mathcal{D}_{\text{rlft}}. For each input x\mathbf{x} from the dataset, a response yθ^+\mathbf{y}_{\hat{\theta}^{+}} is generated by the current policy.
  • Rω^(x,yθ^+)R_{\hat{\omega}}(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}}) is the score assigned by the reward model (with parameters ω^\hat{\omega}) to the generated response for the given input.
Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related