Concept

Limitations of Outcome-Based Rewards for Entire Sequences

Reward models are often used to evaluate an entire sequence, providing feedback based solely on the final outcome. While this outcome-based approach is effective for tasks where correctness is easily verifiable, such as solving a mathematical expression, it proves insufficient for problems that demand complex reasoning. For these tasks, merely knowing if the final answer is right or wrong does not help the model learn the intermediate steps or logical process required to arrive at the correct solution, much like a student who only sees the final answer to a difficult problem cannot identify their mistakes in the reasoning process.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related