1Cademy - Role and Definition of the Reference Model in RLHF

Learn Before

Policy Learning in RLHF

Concept

Role and Definition of the Reference Model in RLHF

In RLHF, the reference model, with parameters denoted by $\theta_{ref}$ , serves as the baseline Large Language Model that provides the starting point for policy training. This model is typically a prior version of the LLM being trained or a model fine-tuned without human feedback, such as an SFT model. During the policy training phase, the reference model has two key functions: it is used to perform sampling across the range of possible outputs, and it is a component in the loss calculation, helping to regulate the policy updates.

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

During the fine-tuning of a language model using a reward signal, a team observes that the model's outputs are becoming nonsensical, even though they receive high reward scores. The model is essentially 'gaming' the reward system. Which component in this training setup is specifically intended to mitigate this issue by penalizing the model for deviating too far from its initial, coherent language patterns?
Diagnosing Training Stagnation in a Reward-Based System
Evaluating Reference Model Selection in Reward-Based Training

Learn Before

Related

Learn After