Activity (Process)

Three-Stage Training Process of RLHF

The practical application of Reinforcement Learning from Human Feedback (RLHF) follows a specific training order composed of three main stages. First, the models are initialized: the reward and value models often start from a pre-trained Large Language Model (LLM), while the reference model and target model (policy) are initialized from an instruction fine-tuned model. At this point, the reference model is fixed and will not be updated further. Second, human preference data is collected to train the reward model. Third, the value model and the policy are trained simultaneously using the optimized reward model. At each position in an output sequence, the value model is updated by minimizing the Mean Squared Error (MSE) of its value prediction, while the policy is updated by minimizing the Proximal Policy Optimization (PPO) loss.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related