1Cademy - Objective Function for Policy Learning in RLHF

Learn Before

Formula

Objective Function for Policy Learning in RLHF

The objective in the policy learning phase of Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters, denoted as $\tilde{\theta}$ , that maximize the expected reward. The optimization process starts with the parameters of a pre-trained model, $\hat{\theta}^{+}$ , and seeks to maximize the reward assigned by a learned reward model, $R_{\hat{\omega}}$ . The formal expression is:

$\tilde{\theta} = \arg \max_{\hat{\theta}^{+}} \mathbb{E}_{(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}}) \sim \mathcal{D}_{\text{rlft}}} R_{\hat{\omega}}(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}})$

Here:

$\tilde{\theta}$ are the optimized policy parameters.
$\arg \max_{\hat{\theta}^{+}}$ indicates that we are searching for the parameters that maximize the objective, starting from the initial parameters $\hat{\theta}^{+}$ .
$\mathbb{E}_{(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}}) \sim \mathcal{D}_{\text{rlft}}}$ represents the expected value over the dataset $\mathcal{D}_{\text{rlft}}$ . For each input $\mathbf{x}$ from the dataset, a response $\mathbf{y}_{\hat{\theta}^{+}}$ is generated by the current policy.
$R_{\hat{\omega}}(\mathbf{x}, \mathbf{y}_{\hat{\theta}^{+}})$ is the score assigned by the reward model (with parameters $\hat{\omega}$ ) to the generated response for the given input.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After