Concept

Dataset Composition for RL Fine-Tuning in RLHF

The dataset used for the reinforcement learning fine-tuning phase, often denoted as Drlft\mathcal{D}_{\mathrm{rlft}}, is generated dynamically. Each training sample is a pair (x,yθ^+)(\mathbf{x}, \mathbf{y}_{\hat{\theta}^+}). The input sequence x\mathbf{x} is drawn from a pre-compiled dataset of inputs. The output yθ^+\mathbf{y}_{\hat{\theta}^+}, however, is not a fixed pre-existing label; rather, it is sampled from the probability distribution Prθ^+(yx)\mathrm{\Pr}_{\hat{\theta}^+}(\mathbf{y}|\mathbf{x}) defined by the current policy of the language model, which is initialized with pre-trained parameters θ^\hat{\theta} and iteratively fine-tuned to reach optimal parameters θ~\tilde{\theta}.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models