Formula

Formula for Log-Probability-Based Reward for Reasoning Paths

The reward for a reasoning path is calculated by summing the log-probabilities of each step being classified as 'correct' by the reward model. This method provides a more granular score than simply counting correct steps. The formula is:

r(x,y)=k=1nslogPr(correctx,yˉk)r(\mathbf{x}, \mathbf{y}) = \sum_{k=1}^{n_s} \log \Pr(\text{correct}|\mathbf{x}, \bar{\mathbf{y}}_{\le k})

where:

  • r(x,y)r(\mathbf{x}, \mathbf{y}) is the total reward for the reasoning path y\mathbf{y} given the input x\mathbf{x}.
  • nsn_s is the total number of steps in the path.
  • logPr(correctx,yˉk)\log \Pr(\text{correct}|\mathbf{x}, \bar{\mathbf{y}}_{\le k}) is the log-probability of the 'correct' label for step kk, as generated by the reward model, given the input and the reasoning path up to that step.

The reward score r(x,y)r(\mathbf{x},\mathbf{y}) can then be used to train the policy in RLHF.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models