Learn Before
Analyzing a Reward System's Weakness
A language model is being trained to solve multi-step math word problems. The training system only checks if the model's final numerical answer is correct. If the final answer is right, the model gets a positive reward; otherwise, it gets a negative reward. Describe a significant potential weakness of this training approach regarding the model's actual problem-solving capabilities.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is training a language model to act as a programming assistant that writes code to solve specific problems. Their training method involves running the code generated by the model. If the code executes without errors and produces the correct output for a set of predefined tests, the model receives a high reward. If the code fails to execute or produces the wrong output, it receives a low reward. The system does not evaluate the elegance, efficiency, or style of the code itself, only the final result of its execution. Which of the following statements best characterizes this evaluation approach?
Analyzing a Reward System's Weakness
Evaluating a Reward System for an AI Tutor