Activity (Process)

RLHF Training Process with PPO

The Reinforcement Learning from Human Feedback (RLHF) process utilizing Proximal Policy Optimization (PPO) unfolds in several stages. Initially, human preference data is collected to train a reward model. Following the optimization of this reward model, the active training phase begins for both the target policy and the value function, using a baseline reference model. At every prediction step, the policy's parameters are updated by computing the sum of the PPO-based loss, which relies on the reward model, reference model, and current value function. Simultaneously, the value function is refined by minimizing the Mean Squared Error (MSE) loss.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related