1Cademy - Process-Based Reward Model as a Classification Task

Learn Before

Step-Level Annotation by Human Experts for Process Supervision

Activity (Process)

Process-Based Reward Model as a Classification Task

A reward model can be trained on step-level annotated data to provide supervisory feedback for policy learning by treating the task as a classification problem. The model, typically a Transformer decoder with a final Softmax layer, takes the problem description ( $\mathbf{x}$ ) and the preceding reasoning steps ( $\bar{\mathbf{y}}_{\le k}$ ) as input at step $k$ . It then outputs a probability distribution over a set of predefined labels, such as $\{\text{correct}, \text{incorrect}\}$ or $\{\text{correct}, \text{incorrect}, \text{neutral}\}$ , to classify the current step.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After