Based on the following case study, identify the primary limitation of the reward mechanism being used and explain why this limitation is particularly problematic for a high-stakes task.

Google

A practical application of an outcome-based reward model is in evaluating mathematical calculations. In this scenario, the model provides positive feedback for a correct final answer and negative feedback for an incorrect one, without assessing the intermediate steps of the calculation.

Example of an Outcome-Based Reward Model in Mathematics

A language model is being trained to solve math problems. The training process uses a reward system that provides feedback based *only* on whether the final numerical answer is correct or incorrect. The model is given the problem `(5 * 4) + (10 / 2)` and produces the following reasoning:
`Step 1: 5 * 4 = 20`
`Step 2: 10 / 2 = 4`
`Step 3: 20 + 4 = 24`
`Final Answer: 24`

How would this reward system evaluate the model's entire response?

Evaluating a Reward Mechanism for a Financial AI

A language model is being trained to solve math problems using a reward system that provides positive feedback only if the final numerical answer is exactly correct, and negative feedback otherwise. The model is given the problem "Calculate 2 to the power of 4 (2⁴)". It produces the following response:

"To solve this, I will multiply 2 by 4. The result of 2 times 4 is 8. Therefore, the final answer is 16."

Based on the described reward system, what feedback (positive or negative) would the model receive for this specific response, and why?

Learn Before

Related