1Cademy - RLHF Policy Optimization as Loss Minimization

Learn Before

LLM Training and Fine-Tuning
Policy Learning in RLHF

Formula

RLHF Policy Optimization as Loss Minimization

The objective of the reinforcement learning phase in RLHF is to minimize a loss function, formally expressed as min L(x, {y1, y2}, r). This function is designed to optimize the language model's policy. The loss L is calculated using the input prompt x, a set of sampled outputs like {y1, y2}, and a reward model r. This reward model, which is pre-trained on human preference data, provides the critical feedback signal within the loss function, guiding the policy towards generating responses that align with human preferences.