1Cademy - An AI development team is training a policy model for a chatbot using a combined reward signal. This signal is a weighted average of scores from two reward models: one for Helpfulness (scoring accuracy and completeness) and one for Harmlessness (scoring safety and ethical considerations). The team observes that the resulting chatbot is overly cautious, frequently refusing to answer benign questions by stating it cannot help. Which of the following is the most effective and direct adjustment

Learn Before

Using Combined Reward for Policy Supervision

Multiple Choice

An AI development team is training a policy model for a chatbot using a combined reward signal. This signal is a weighted average of scores from two reward models: one for 'Helpfulness' (scoring accuracy and completeness) and one for 'Harmlessness' (scoring safety and ethical considerations). The team observes that the resulting chatbot is overly cautious, frequently refusing to answer benign questions by stating it cannot help. Which of the following is the most effective and direct adjustment

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related