Based on the standard training process for language models fine-tuned with human feedback, what specific component is designed to prevent the kind of extreme behavioral change described in the case study below, and how does it function to counteract the model's tendency to over-optimize for the flawed reward signal?

Google

In Reinforcement Learning from Human Feedback (RLHF), the reference policy, denoted as $\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$, is a fixed policy used as a baseline during the optimization of the active policy $\pi_{\theta}$. It is typically a copy of the supervised fine-tuned (SFT) model before the RLHF stage begins. The reference policy's role is to prevent the active policy from deviating too far from the original language style and safety constraints, which is enforced by a penalty term (e.g., KL-divergence) that measures the difference between the two policies.

Reference Policy in RLHF

The goal of the policy training stage in Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters $$\tilde{\theta}$$ that maximize expected reward without deviating too far from a reference policy. The training objective evaluates the quality of an output $$\mathbf{y}$$ given an input $$\mathbf{x}$$ using a reward model $$r(\mathbf{x},\mathbf{y})$$. The objective minimizes the negative reward (loss) and includes a penalty for policy divergence:

$$\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]$$

Here, the penalty regularizes the current policy $$\pi_{\theta}$$ against the reference policy $$\pi_{\theta_{\mathrm{ref}}}$$ using a coefficient $$\beta$$.

RLHF Policy Optimization Objective

The penalty term in PPO for language models quantifies the divergence between the current policy $\text{Pr}_{\theta}$ and a reference policy $\text{Pr}_{\theta_{\text{ref}}}$. It is defined as the difference in the log-probabilities of generating the response $\mathbf{y}$ given the prompt $\mathbf{x}$: $$ \text{Penalty} = \log \text{Pr}_{\theta}(\mathbf{y}|\mathbf{x}) - \log \text{Pr}_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) $$ For autoregressive language models, this can be decomposed exactly into a sum over the tokens in the sequence: $$ \text{Penalty} = \sum_{t=1}^{T} \log \text{Pr}_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t}) - \sum_{t=1}^{T} \log \text{Pr}_{\theta_{\text{ref}}}(y_t|\mathbf{x}, \mathbf{y}_{<t}) $$

Policy Divergence Penalty for Language Models

A penalty term is incorporated into the RLHF objective function to regularize the policy and prevent it from deviating excessively from a reference policy. This penalty is formulated as the difference between the log probabilities of a sequence under the current policy ($\theta$) and the reference policy ($\theta_{ref}$), summed over all tokens in the sequence. The formula is: $Penalty = \log Pr_{\theta}(y|x) - \log Pr_{\theta_{ref}}(y|x) = \sum_{t=1}^{T} \log Pr_{\theta}(y_t|x, y_{<t}) - \sum_{t=1}^{T} \log Pr_{\theta_{ref}}(y_t|x, y_{<t})$.

KL-Divergence Penalty in RLHF Policy Optimization

An AI development team is fine-tuning a language model using a reinforcement learning process guided by a reward model. They observe that the model's outputs, while receiving high scores from the reward model, are becoming stylistically unnatural and deviating significantly from the helpful tone established during its initial supervised training. Which of the following adjustments to the training process is most specifically designed to counteract this behavioral drift?

Diagnosing and Mitigating Reward Hacking

Imagine a team is training a large language model using a reinforcement learning process. They have a reward model that accurately scores outputs for helpfulness. However, they decide to optimize their active policy to maximize this reward directly, without comparing it to a fixed, initial version of the model. Analyze the potential negative consequences of this approach. Describe at least two distinct undesirable behaviors the final model might exhibit and explain why these behaviors could arise in the absence of this comparison.

Learn Before

Related