1Cademy - RLHF Objective Function

Learn Before

Reinforcement Learning from Human Feedback (RLHF)

Formula

RLHF Objective Function

The objective in the final stage of Reinforcement Learning from Human Feedback (RLHF) is to fine-tune the LLM by minimizing a reinforcement learning loss function. This objective can be expressed as min L(x, {y_1, y_2}, r), where L is the loss function, x is the input prompt, {y_1, y_2} represents the outputs generated by the LLM, and r is the reward signal provided by the trained Reward Model. The optimization process adjusts the LLM's parameters to increase the probability of generating outputs that receive a high reward from the Reward Model.