Learn Before
A team training a reward model observes a peculiar behavior: the model consistently assigns higher scores to generated text that ends with the phrase '...and that is the final answer.', even when the main body of the text is of poor quality. The reward score is calculated by applying a linear transformation to the hidden state vector corresponding to the final token of the input sequence. Which of the following provides the most direct explanation for this behavior?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team training a reward model observes a peculiar behavior: the model consistently assigns higher scores to generated text that ends with the phrase '...and that is the final answer.', even when the main body of the text is of poor quality. The reward score is calculated by applying a linear transformation to the hidden state vector corresponding to the final token of the input sequence. Which of the following provides the most direct explanation for this behavior?
Critique of the Last-Token Reward Calculation Method
An engineer is implementing a reward model where the final scalar score
ris computed from the last hidden state vectorh_lastusing the formular = h_last * W_r. If the hidden state vectorh_lasthas dimensions of[1 x 4096], what must be the dimensions of the weight matrixW_rfor the formula to produce a single scalar value?