1Cademy - Dataset Composition for RL Fine-Tuning in RLHF

Learn Before

Concept

Dataset Composition for RL Fine-Tuning in RLHF

The dataset used for the reinforcement learning fine-tuning phase, often denoted as $\mathcal{D}_{\mathrm{rlft}}$ , is generated dynamically. Each training sample is a pair $(\mathbf{x}, \mathbf{y}_{\hat{\theta}^+})$ . The input sequence $\mathbf{x}$ is drawn from a pre-compiled dataset of inputs. The output $\mathbf{y}_{\hat{\theta}^+}$ , however, is not a fixed pre-existing label; rather, it is sampled from the probability distribution $\mathrm{\Pr}_{\hat{\theta}^+}(\mathbf{y}|\mathbf{x})$ defined by the current policy of the language model, which is initialized with pre-trained parameters $\hat{\theta}$ and iteratively fine-tuned to reach optimal parameters $\tilde{\theta}$ .

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After