1Cademy - A team training a reward model observes a peculiar behavior: the model consistently assigns higher scores to generated text that ends with the phrase ...and that is the final answer., even when the main body of the text is of poor quality. The reward score is calculated by applying a linear transformation to the hidden state vector corresponding to the final token of the input sequence. Which of the following provides the most direct explanation for this behavior?

Learn Before

Reward Score Formula for LLM-based Reward Models

Multiple Choice

A team training a reward model observes a peculiar behavior: the model consistently assigns higher scores to generated text that ends with the phrase '...and that is the final answer.', even when the main body of the text is of poor quality. The reward score is calculated by applying a linear transformation to the hidden state vector corresponding to the final token of the input sequence. Which of the following provides the most direct explanation for this behavior?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related