1Cademy - Training a Reward Model with Preference Data

Learn Before

Data Collection for Reward Modeling in RLHF
Reward Model Learning in RLHF

Activity (Process)

Training a Reward Model with Preference Data

The process of training a reward model involves using the collected preference labels. For each label, the model is fed the original input prompt, the pair of generated outputs, and the corresponding preference data. This information is used to adjust the model's parameters so it can learn to predict which responses humans would prefer.

Updated 2026-05-03

Contributors are: