Activity (Process)

Process-Based Reward Model as a Classification Task

A reward model can be trained on step-level annotated data to provide supervisory feedback for policy learning by treating the task as a classification problem. The model, typically a Transformer decoder with a final Softmax layer, takes the problem description (x\mathbf{x}) and the preceding reasoning steps (yˉk\bar{\mathbf{y}}_{\le k}) as input at step kk. It then outputs a probability distribution over a set of predefined labels, such as {correct,incorrect}\{\text{correct}, \text{incorrect}\} or {correct,incorrect,neutral}\{\text{correct}, \text{incorrect}, \text{neutral}\}, to classify the current step.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models