An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:
- Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
- Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}
Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing Training Instability from Reward Design
An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:
- Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
- Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}
Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?
Critiquing a Reward Function for Maze Navigation