Formula

RLHF Objective Function

The objective in the final stage of Reinforcement Learning from Human Feedback (RLHF) is to fine-tune the LLM by minimizing a reinforcement learning loss function. This objective can be expressed as min L(x, {y_1, y_2}, r), where L is the loss function, x is the input prompt, {y_1, y_2} represents the outputs generated by the LLM, and r is the reward signal provided by the trained Reward Model. The optimization process adjusts the LLM's parameters to increase the probability of generating outputs that receive a high reward from the Reward Model.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related