Reward Signal Transformation in a Sequential Task
A robot arm is trained to stack three blocks in a specific order. It only receives a reward of +10 after placing the third and final block, and only if the entire stack is correct. For all intermediate actions (placing the first and second blocks), the reward is zero. Describe how this single, final reward can be transformed into a dense supervision signal for each of the three actions. Explain why this transformation helps the robot learn more effectively.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Improving Learning for a Maze-Solving Agent
An agent is learning to generate a five-sentence summary of a document. It only receives a final quality score (e.g., +0.9) after the entire summary is complete. To improve training, this single final score is used to create a learning signal for each of the five sentences generated. Which of the following options best analyzes how this transformation from a single score to multiple signals works?
Reward Signal Transformation in a Sequential Task