Concept

Architecture and Function of the RLHF Reward Model

In Reinforcement Learning from Human Feedback (RLHF), the reward model evaluates a concatenated sequence of an input prompt x\mathbf{x} and an output y\mathbf{y}. Using a pre-trained Large Language Model (specifically a Transformer decoder) as the base, the model extracts the representation at the last position, denoted as hlast\mathbf{h}_{\mathrm{last}}, to represent the full semantic content of the sequence. This dd-dimensional vector is then mapped to a scalar reward score via a linear transformation: r(x,y)=hlastWrr(\mathbf{x},\mathbf{y}) = \mathbf{h}_{\mathrm{last}} \mathbf{W}_r, where Wr\mathbf{W}_r is a d×1d \times 1 linear mapping matrix. The score rr measures how well the output aligns with the desired behavior.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related