Example

Diagram of Reward Score Calculation using an LLM

The process of calculating a reward score using a Transformer-based LLM is illustrated by a data flow. First, input prompt tokens (x0,...,xmx_0, ..., x_m) are concatenated with response tokens (y1,...,yny_1, ..., y_n), followed by a special end-of-sequence token like ⟨EOS⟩. This combined sequence is fed into a Transformer Decoder (LLM), which outputs a hidden state representation for each token position (hx0,...,hlasth_{x0}, ..., h_{last}). The final hidden state, hlasth_{last}, corresponding to the ⟨EOS⟩ token, is selected to represent the entire sequence. This vector is then transformed by a linear mapping layer with weights WrW_r to produce a single scalar value, which serves as the reward score.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences