1Cademy - Reward Function as a Linear Transformation of the Last Hidden State

Learn Before

Notation for the RLHF Reward Model

Formula

Reward Function as a Linear Transformation of the Last Hidden State

The formula $r(\mathbf{x}, y) = \mathbf{h}_{\text{last}} \mathbf{W}_r$ defines a reward function where the reward $r$ for a given prompt $\mathbf{x}$ and generated output $y$ is calculated as a linear function of the final hidden state, $\mathbf{h}_{\text{last}}$ , of the language model that produced $y$ . Here, $\mathbf{h}_{\text{last}}$ is the vector representation of the last token in the output sequence, and $\mathbf{W}_r$ is a learned weight matrix or vector that transforms this hidden state into a scalar reward value.