1Cademy - Notation for the RLHF Reward Model

Learn Before

Function and Inputs of the RLHF Reward Model

Formula

Notation for the RLHF Reward Model

The function of the reward model in RLHF is expressed as $r = \text{Reward}(\mathbf{x}, y)$ , where $r$ is the scalar reward, $\mathbf{x}$ is the input prompt, and $y$ is the generated output. The reward, $r$ , measures how well the output $y$ aligns with desired behavior for the input $\mathbf{x}$ . For notational simplicity, this function is often denoted as $r(\mathbf{x}, y)$ .