Learn Before
Reward Transformation Formula
This formula defines a transformed reward function, , based on an original reward function, . The new reward is calculated by adding an arbitrary function, , to the original reward. All functions are dependent on the current state (), action (), and the subsequent state (). The mathematical expression is: This demonstrates how alternative reward functions can be generated, which is a core aspect of why reward models can be underdetermined.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Role of Regularization in Mitigating Reward Model Underdetermination
Reward Transformation Formula
A research team is training a model to score the quality of text responses. The training data consists of pairs of responses, where for each pair, one is labeled as 'better' than the other. The model's objective is to assign a higher score to the 'better' response in every pair. The team successfully trains two models, Model A and Model B. They discover that the internal parameters of Model A and Model B are significantly different. However, both models achieve 100% accuracy on the training data, correctly assigning a higher score to the 'better' response in every single pair. What fundamental principle of model training does this outcome best demonstrate?
Analyzing Reward Model Discrepancies
Explaining Score Discrepancies in Trained Models
Learn After
A research team is training an agent and finds that two different reward functions,
r_1andr_2, lead to the agent learning the exact same optimal behavior. The relationship between the two functions is defined asr_2(s_t, a_t, s_{t+1}) = r_1(s_t, a_t, s_{t+1}) + f(s_t, a_t, s_{t+1})for some non-zero functionf. What is the most accurate explanation for this phenomenon?Analyzing Reward Function Equivalence
Analyzing Reward Function Invariance