Formula

Reward Score Formula for LLM-based Reward Models

When a reward model is implemented using a pre-trained Large Language Model (LLM), the scalar reward score r(x,y)r(\mathbf{x}, \mathbf{y}) is computed by applying a linear transformation to the representation at the final position. The formula is: r(x,y)=hlastWrr(\mathbf{x}, \mathbf{y}) = \mathbf{h}_{\mathrm{last}} \mathbf{W}_r, where hlast\mathbf{h}_{\mathrm{last}} is a dd-dimensional hidden state vector from the top-most Transformer layer corresponding to the last token of the concatenated prompt and response sequence, and Wr\mathbf{W}_r is a d×1d \times 1 linear mapping matrix.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences