Function and Inputs of the RLHF Reward Model
Within the Reinforcement Learning from Human Feedback (RLHF) framework, the reward model is structured as a neural network. Its specific purpose is to process a pair of token sequences, consisting of an input x and a corresponding output y, and map them to a single scalar value, which represents the reward.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reward Model Implementation using a Pre-trained LLM
Troubleshooting a Reward Model's Architecture
Both a standard generative language model and an RLHF reward model are often based on the same core architecture (e.g., a Transformer decoder). What is the key architectural modification that allows the reward model to produce a single scalar quality score for a given text, rather than generating a new sequence of text?
Adapting a Language Model for Reward Prediction
Function and Inputs of the RLHF Reward Model
Sequence-Level Evaluation in Reward Models
Learn After
Notation for the RLHF Reward Model
A system designed to improve language model outputs uses a special component. This component takes a user's initial text (a prompt) and a model-generated response, then outputs a single numerical score. If this component processes two different responses for the exact same prompt, giving 'Response A' a score of 4.1 and 'Response B' a score of -0.5, what is the most accurate interpretation of these scores?
Identifying Reward Model Inputs and Output
Troubleshooting a Flawed Reward Model
Semantic Completeness in RLHF Reward Models