Concept

Applying the Plackett-Luce Model to RLHF Reward Modeling

In the context of Reinforcement Learning from Human Feedback (RLHF), the Plackett-Luce model can be adapted for training the reward model on listwise preference data. This approach involves defining the 'worth' of each generated response y within a ranked list Y as a function of the reward model's output. This allows the reward model to be optimized based on the probability of the entire observed ranking, offering a more holistic alternative to aggregating pairwise losses.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course