1Cademy - Diagram of Reward Score Calculation using an LLM

Learn Before

Reward Model Implementation using a Pre-trained LLM

Example

Diagram of Reward Score Calculation using an LLM

The process of calculating a reward score using a Transformer-based LLM is illustrated by a data flow. First, input prompt tokens ( $x_0, ..., x_m$ ) are concatenated with response tokens ( $y_1, ..., y_n$ ), followed by a special end-of-sequence token like ⟨EOS⟩. This combined sequence is fed into a Transformer Decoder (LLM), which outputs a hidden state representation for each token position ( $h_{x0}, ..., h_{last}$ ). The final hidden state, $h_{last}$ , corresponding to the ⟨EOS⟩ token, is selected to represent the entire sequence. This vector is then transformed by a linear mapping layer with weights $W_r$ to produce a single scalar value, which serves as the reward score.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After