Learn Before
Semantic Completeness in RLHF Reward Models
In Reinforcement Learning from Human Feedback (RLHF), the reward model assumes that both the input prompt and the generated output are complete texts. Because of this, the reward model evaluates the relationship between inputs and outputs that provide full semantic content, rather than assessing partial or incomplete fragments.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Notation for the RLHF Reward Model
A system designed to improve language model outputs uses a special component. This component takes a user's initial text (a prompt) and a model-generated response, then outputs a single numerical score. If this component processes two different responses for the exact same prompt, giving 'Response A' a score of 4.1 and 'Response B' a score of -0.5, what is the most accurate interpretation of these scores?
Identifying Reward Model Inputs and Output
Troubleshooting a Flawed Reward Model
Semantic Completeness in RLHF Reward Models