1Cademy - Applying the Plackett-Luce Model to RLHF Reward Modeling

Learn Before

Plackett-Luce Model for Listwise Ranking

Concept

Applying the Plackett-Luce Model to RLHF Reward Modeling

In the context of Reinforcement Learning from Human Feedback (RLHF), the Plackett-Luce model can be adapted for training the reward model on listwise preference data. This approach involves defining the 'worth' of each generated response y within a ranked list Y as a function of the reward model's output. This allows the reward model to be optimized based on the probability of the entire observed ranking, offering a more holistic alternative to aggregating pairwise losses.