Learn Before
Reward Score Formula for LLM-based Reward Models
When a reward model is implemented using a pre-trained Large Language Model (LLM), the scalar reward score is computed by applying a linear transformation to the representation at the final position. The formula is: , where is a -dimensional hidden state vector from the top-most Transformer layer corresponding to the last token of the concatenated prompt and response sequence, and is a linear mapping matrix.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reward Score Formula for LLM-based Reward Models
End-of-Sequence Reward Assignment in RLHF
In a system designed to evaluate the quality of generated text, a complex neural network first processes a prompt and its corresponding response, ultimately producing a high-dimensional vector that captures the nuanced meaning and relationship between them. What is the essential final step required to convert this complex vector into a practical, usable evaluation, and what is the nature of its output?
Troubleshooting a Reward Model's Output
From Representation to Reward
Learn After
A team training a reward model observes a peculiar behavior: the model consistently assigns higher scores to generated text that ends with the phrase '...and that is the final answer.', even when the main body of the text is of poor quality. The reward score is calculated by applying a linear transformation to the hidden state vector corresponding to the final token of the input sequence. Which of the following provides the most direct explanation for this behavior?
Critique of the Last-Token Reward Calculation Method
An engineer is implementing a reward model where the final scalar score
ris computed from the last hidden state vectorh_lastusing the formular = h_last * W_r. If the hidden state vectorh_lasthas dimensions of[1 x 4096], what must be the dimensions of the weight matrixW_rfor the formula to produce a single scalar value?