1Cademy - Optimal Parameters Formula in RL Fine-Tuning

Learn Before

Dataset Composition for RL Fine-Tuning in RLHF

Formula

Optimal Parameters Formula in RL Fine-Tuning

In reinforcement learning (RL) fine-tuning, the optimal parameters, denoted as $\tilde{\theta}$ , are obtained by fine-tuning the pre-trained parameters $\hat{\theta}$ . This optimization seeks to maximize an expected value over the RL fine-tuning dataset, $\mathcal{D}_{\mathrm{rlft}}$ , using the formula:

$\tilde{\theta} = \argmax_{\hat{\theta}^+} \mathbb{E}_{(\mathbf{x},\mathbf{y}_{\hat{\theta}^+}) \sim \mathcal{D}_{\mathrm{rlft}}} R_{\hat{\omega}}(\mathbf{x},\mathbf{y}_{\hat{\theta}^+})$

In this equation, $\hat{\theta}^+$ represents the parameters of the active policy being optimized, while $R_{\hat{\omega}}$ evaluates the paired sample of the input sequence $\mathbf{x}$ and the model-generated output $\mathbf{y}_{\hat{\theta}^+}$ .

0

1

Updated 2026-04-20

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related