Learn Before
Notation for the RLHF Reward Model
The function of the reward model in RLHF is expressed as , where is the scalar reward, is the input prompt, and is the generated output. The reward, , measures how well the output aligns with desired behavior for the input . For notational simplicity, this function is often denoted as .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Notation for the RLHF Reward Model
A system designed to improve language model outputs uses a special component. This component takes a user's initial text (a prompt) and a model-generated response, then outputs a single numerical score. If this component processes two different responses for the exact same prompt, giving 'Response A' a score of 4.1 and 'Response B' a score of -0.5, what is the most accurate interpretation of these scores?
Identifying Reward Model Inputs and Output
Troubleshooting a Flawed Reward Model
Semantic Completeness in RLHF Reward Models
Learn After
A language model is given the input prompt, 'Write a short poem about a rainy day.' It generates the response, 'The sky weeps, and the world listens.' A separate evaluation model then assesses this response for the given prompt and assigns it a quality score of 9.2. If this evaluation process is represented by the function , which option correctly assigns the elements of this scenario to the function's variables?
In the context of evaluating a language model's output, a function is commonly expressed as . Match each component of this notation to its correct description.
Reward Function as a Linear Transformation of the Last Hidden State
Aggregated Reward as the Sum of Segment-Based Rewards
Interpreting Reward Model Notation