1Cademy - Architecture and Function of the RLHF Reward Model

Learn Before

Concept

Architecture and Function of the RLHF Reward Model

In Reinforcement Learning from Human Feedback (RLHF), the reward model evaluates a concatenated sequence of an input prompt $\mathbf{x}$ and an output $\mathbf{y}$ . Using a pre-trained Large Language Model (specifically a Transformer decoder) as the base, the model extracts the representation at the last position, denoted as $\mathbf{h}_{\mathrm{last}}$ , to represent the full semantic content of the sequence. This $d$ -dimensional vector is then mapped to a scalar reward score via a linear transformation: $r(\mathbf{x},\mathbf{y}) = \mathbf{h}_{\mathrm{last}} \mathbf{W}_r$ , where $\mathbf{W}_r$ is a $d \times 1$ linear mapping matrix. The score $r$ measures how well the output aligns with the desired behavior.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After