Formula

Optimal Parameters Formula in RL Fine-Tuning

In reinforcement learning (RL) fine-tuning, the optimal parameters, denoted as θ~\tilde{\theta}, are obtained by fine-tuning the pre-trained parameters θ^\hat{\theta}. This optimization seeks to maximize an expected value over the RL fine-tuning dataset, Drlft\mathcal{D}_{\mathrm{rlft}}, using the formula:

θ~=arg maxθ^+E(x,yθ^+)DrlftRω^(x,yθ^+)\tilde{\theta} = \argmax_{\hat{\theta}^+} \mathbb{E}_{(\mathbf{x},\mathbf{y}_{\hat{\theta}^+}) \sim \mathcal{D}_{\mathrm{rlft}}} R_{\hat{\omega}}(\mathbf{x},\mathbf{y}_{\hat{\theta}^+})

In this equation, θ^+\hat{\theta}^+ represents the parameters of the active policy being optimized, while Rω^R_{\hat{\omega}} evaluates the paired sample of the input sequence x\mathbf{x} and the model-generated output yθ^+\mathbf{y}_{\hat{\theta}^+}.

0

1

Updated 2026-04-20

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences