1Cademy - Log-Probability-Based Reward for Reasoning Paths

Method A: Assigns a score of +1 to each step where the probability of being &#x27;correct&#x27; is greater than 0.5. The total score is the sum of these step scores.
Method B: Calculates the total score by summing the logarithm of the &#x27;correct&#x27; probability for every step in the process.

Learn Before

Process-Based Reward Model as a Classification Task

Activity (Process)

Log-Probability-Based Reward for Reasoning Paths

An alternative method for evaluating a reasoning path involves using the log-probabilities generated by the step-level classification model. Instead of simply counting the number of steps deemed 'correct', this approach aggregates the log-probabilities to form the total reward for the entire path, providing a more nuanced score.