Learn Before
Explaining Score Discrepancies in Trained Models
An AI development team trains two separate models, Model X and Model Y, on the exact same dataset of human preferences. The training objective for both models is to assign a higher score to the preferred response in each pair. After training, both models achieve perfect accuracy on the training set. However, when the team inspects the models, they find that for a specific response, Model X assigns a score of 1.5, while Model Y assigns a score of 25.0. Explain, in the context of the training objective, how it is possible for both models to be considered perfectly trained despite producing such different absolute scores for the same input.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Role of Regularization in Mitigating Reward Model Underdetermination
Reward Transformation Formula
A research team is training a model to score the quality of text responses. The training data consists of pairs of responses, where for each pair, one is labeled as 'better' than the other. The model's objective is to assign a higher score to the 'better' response in every pair. The team successfully trains two models, Model A and Model B. They discover that the internal parameters of Model A and Model B are significantly different. However, both models achieve 100% accuracy on the training data, correctly assigning a higher score to the 'better' response in every single pair. What fundamental principle of model training does this outcome best demonstrate?
Analyzing Reward Model Discrepancies
Explaining Score Discrepancies in Trained Models