1Cademy - Limitations of Outcome-Based Rewards for Entire Sequences

Learn Before

Concept

Limitations of Outcome-Based Rewards for Entire Sequences

Reward models are often used to evaluate an entire sequence, providing feedback based solely on the final outcome. While this outcome-based approach is effective for tasks where correctness is easily verifiable, such as solving a mathematical expression, it proves insufficient for problems that demand complex reasoning. For these tasks, merely knowing if the final answer is right or wrong does not help the model learn the intermediate steps or logical process required to arrive at the correct solution, much like a student who only sees the final answer to a difficult problem cannot identify their mistakes in the reasoning process.

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After