1Cademy - A language model is being trained to solve math problems. The training process uses a reward system that provides feedback based *only* on whether the final numerical answer is correct or incorrect. The model is given the problem `(5 * 4) + (10 / 2)` and produces the following reasoning: `Step 1: 5 * 4 = 20` `Step 2: 10 / 2 = 4` `Step 3: 20 + 4 = 24` `Final Answer: 24` How would this reward system evaluate the models entire response?

Learn Before

Example of an Outcome-Based Reward Model in Mathematics

Multiple Choice

A language model is being trained to solve math problems. The training process uses a reward system that provides feedback based only on whether the final numerical answer is correct or incorrect. The model is given the problem (5 * 4) + (10 / 2) and produces the following reasoning: Step 1: 5 * 4 = 20 Step 2: 10 / 2 = 4 Step 3: 20 + 4 = 24 Final Answer: 24

How would this reward system evaluate the model's entire response?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related