1Cademy - Reward Score Formula for LLM-based Reward Models

Learn Before

Final Reward Score Calculation in RLHF

Formula

Reward Score Formula for LLM-based Reward Models

When a reward model is implemented using a pre-trained Large Language Model (LLM), the scalar reward score $r(\mathbf{x}, \mathbf{y})$ is computed by applying a linear transformation to the representation at the final position. The formula is: $r(\mathbf{x}, \mathbf{y}) = \mathbf{h}_{\mathrm{last}} \mathbf{W}_r$ , where $\mathbf{h}_{\mathrm{last}}$ is a $d$ -dimensional hidden state vector from the top-most Transformer layer corresponding to the last token of the concatenated prompt and response sequence, and $\mathbf{W}_r$ is a $d \times 1$ linear mapping matrix.