Critiquing a Reward Function for Maze Navigation
An AI agent is being trained to navigate a maze. The reward function is defined as: +0.1 for each step taken, -100 for hitting a wall, and +1 for reaching the exit. The agent consistently learns to avoid walls but struggles to find the exit, often wandering aimlessly. Based on the principles of gradient estimation, identify the primary issue with this reward structure that contributes to the agent's poor performance and propose a specific numerical change to address it.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analyzing Training Instability from Reward Design
An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:
- Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
- Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}
Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?
Critiquing a Reward Function for Maze Navigation