Reusing Transformer Training for Reward Models
Because the reward model used in alignment is itself a Large Language Model (LLM), its optimization process can directly leverage standard Transformer training procedures. The primary modification required is switching out the traditional cross-entropy loss function, which is typically used for standard LLM pre-training, with a pairwise comparison loss function to learn from human preference data.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Optimal Reward Model Parameter Estimation
Empirical Reward Model Loss Formula using Bradley-Terry Model
Pair-wise Ranking Loss Formula for RLHF Reward Model
Correcting a Reward Model's Preference Error
A reward model is being trained using a dataset where each entry consists of a prompt, a 'preferred' response, and a 'rejected' response, as judged by humans. The training process works by adjusting the model's parameters to minimize a ranking loss function. What is the primary effect of successfully minimizing this ranking loss?
A reward model is being trained on a dataset of human preferences, where each data point consists of a prompt, a preferred response, and a rejected response. The training process aims to minimize a ranking loss function. For a single data point, which of the following outcomes would generate the largest loss value, thereby prompting the most significant update to the model's parameters?
Reusing Transformer Training for Reward Models