Analyze the following training scenario and identify the most likely cause of the observed problem, explaining the underlying mechanism.

Google

In RLHF, the reference model, with parameters denoted by $\theta_{ref}$, serves as the baseline Large Language Model that provides the starting point for policy training. This model is typically a prior version of the LLM being trained or a model fine-tuned without human feedback, such as an SFT model. During the policy training phase, the reference model has two key functions: it is used to perform sampling across the range of possible outputs, and it is a component in the loss calculation, helping to regulate the policy updates.

Role and Definition of the Reference Model in RLHF

During the fine-tuning of a language model using a reward signal, a team observes that the model's outputs are becoming nonsensical, even though they receive high reward scores. The model is essentially 'gaming' the reward system. Which component in this training setup is specifically intended to mitigate this issue by penalizing the model for deviating too far from its initial, coherent language patterns?

Diagnosing Training Stagnation in a Reward-Based System

In a reward-based training process for a language model, a fixed 'reference model' is used to regularize the policy updates, preventing the main model from deviating too drastically from a known, stable distribution. Evaluate the trade-offs involved in choosing this reference model. Specifically, compare the potential outcomes of using the initial, pre-trained base model versus using a model that has already undergone some initial instruction-based fine-tuning as the reference.

Learn Before

Related