Analyzing Training Instability from Reward Design
An engineer is training a language model to generate helpful and safe responses. The model receives a reward of +1 for each helpful sentence it produces. However, if any part of its response is flagged as unsafe, the entire response receives a reward of -100. The engineer observes that the training process is very unstable; the model struggles to improve consistently, and its performance fluctuates wildly between training updates. Based on this scenario, analyze the most probable cause of this training instability, specifically relating it to the design of the reward system.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing Training Instability from Reward Design
An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:
- Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
- Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}
Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?
Critiquing a Reward Function for Maze Navigation