Learn Before
Diagnosing Undesirable Model Behavior
A text-generation model is being trained using a feedback signal derived from two independent scoring systems: one measuring 'informativeness' (how detailed and factual the text is) and another measuring 'safety' (how free the text is from biased or inappropriate content). The final feedback score used to update the model is a simple, unweighted average of the scores from these two systems.
After training, evaluators observe that the model consistently produces highly informative text, but it also frequently generates unsafe content.
Analyze this situation. Why would a simple averaging of scores lead to this specific undesirable outcome, even when one of the scoring systems is correctly identifying unsafe content?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing Undesirable Model Behavior
An AI development team is training a policy model for a chatbot using a combined reward signal. This signal is a weighted average of scores from two reward models: one for 'Helpfulness' (scoring accuracy and completeness) and one for 'Harmlessness' (scoring safety and ethical considerations). The team observes that the resulting chatbot is overly cautious, frequently refusing to answer benign questions by stating it cannot help. Which of the following is the most effective and direct adjustment to the training process to correct this specific behavior?