Reward Model Implementation using a Pre-trained LLM
A common method for creating a reward model is to adapt a pre-trained Large Language Model (LLM). Given an input prompt and a response , they are concatenated to form a single sequence , which is processed from left to right using forced decoding. Because language models restrict each position to accessing only its left context, the representation at the first position cannot capture the full sequence. Instead, a special symbol (e.g., ) is appended to the end of the sequence. The corresponding output from the top-most Transformer layer at this final position is selected as the comprehensive representation of the entire sequence.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
Reward Model Implementation using a Pre-trained LLM
Troubleshooting a Reward Model's Architecture
Both a standard generative language model and an RLHF reward model are often based on the same core architecture (e.g., a Transformer decoder). What is the key architectural modification that allows the reward model to produce a single scalar quality score for a given text, rather than generating a new sequence of text?
Adapting a Language Model for Reward Prediction
Function and Inputs of the RLHF Reward Model
Sequence-Level Evaluation in Reward Models
Learn After
Pair-wise Ranking Loss Formula for RLHF Reward Model
Input Formulation for the RLHF Reward Model
Diagram of Reward Score Calculation using an LLM
An engineer is implementing a reward model by adapting a pre-trained language model. After feeding a concatenated prompt and response sequence into the model, they have access to the final layer's hidden state vector for each token in the sequence. To derive a single scalar reward score from these vectors, which of the following procedures should they implement?
You are tasked with implementing a reward model to score a response generated for a given prompt. Arrange the following steps in the correct chronological order to transform the prompt-response pair into a final scalar reward score.
Reward Model Implementation Analysis