1Cademy - Formula for Log-Probability-Based Reward for Reasoning Paths

Method A: Assigns a score of +1 to each step where the probability of being &#x27;correct&#x27; is greater than 0.5. The total score is the sum of these step scores.
Method B: Calculates the total score by summing the logarithm of the &#x27;correct&#x27; probability for every step in the process.

Learn Before

Log-Probability-Based Reward for Reasoning Paths

Formula

Formula for Log-Probability-Based Reward for Reasoning Paths

The reward for a reasoning path is calculated by summing the log-probabilities of each step being classified as 'correct' by the reward model. This method provides a more granular score than simply counting correct steps. The formula is:

$r(\mathbf{x}, \mathbf{y}) = \sum_{k=1}^{n_s} \log \Pr(\text{correct}|\mathbf{x}, \bar{\mathbf{y}}_{\le k})$

where:

$r(\mathbf{x}, \mathbf{y})$ is the total reward for the reasoning path $\mathbf{y}$ given the input $\mathbf{x}$ .
$n_s$ is the total number of steps in the path.
$\log \Pr(\text{correct}|\mathbf{x}, \bar{\mathbf{y}}_{\le k})$ is the log-probability of the 'correct' label for step $k$ , as generated by the reward model, given the input and the reasoning path up to that step.