Formula

Reward Function as a Linear Transformation of the Last Hidden State

The formula r(x,y)=hlastWrr(\mathbf{x}, y) = \mathbf{h}_{\text{last}} \mathbf{W}_r defines a reward function where the reward rr for a given prompt x\mathbf{x} and generated output yy is calculated as a linear function of the final hidden state, hlast\mathbf{h}_{\text{last}}, of the language model that produced yy. Here, hlast\mathbf{h}_{\text{last}} is the vector representation of the last token in the output sequence, and Wr\mathbf{W}_r is a learned weight matrix or vector that transforms this hidden state into a scalar reward value.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences