Formula

RLHF Policy Optimization as Loss Minimization

The objective of the reinforcement learning phase in RLHF is to minimize a loss function, formally expressed as min L(x, {y1, y2}, r). This function is designed to optimize the language model's policy. The loss L is calculated using the input prompt x, a set of sampled outputs like {y1, y2}, and a reward model r. This reward model, which is pre-trained on human preference data, provides the critical feedback signal within the loss function, guiding the policy towards generating responses that align with human preferences.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related