Learn Before
Formula for Log-Probability-Based Reward for Reasoning Paths
The reward for a reasoning path is calculated by summing the log-probabilities of each step being classified as 'correct' by the reward model. This method provides a more granular score than simply counting correct steps. The formula is:
where:
- is the total reward for the reasoning path given the input .
- is the total number of steps in the path.
- is the log-probability of the 'correct' label for step , as generated by the reward model, given the input and the reasoning path up to that step.
The reward score can then be used to train the policy in RLHF.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Formula for Log-Probability-Based Reward for Reasoning Paths
Consider two methods for scoring a multi-step reasoning process generated by an AI. Both methods use an underlying model that, for each step, outputs a probability that the step is 'correct'.
- Method A: Assigns a score of +1 to each step where the probability of being 'correct' is greater than 0.5. The total score is the sum of these step scores.
- Method B: Calculates the total score by summing the logarithm of the 'correct' probability for every step in the process.
Now, analyze two reasoning paths for the same problem:
- Path 1: Consists of 3 steps, each with a 'correct' probability of 0.9.
- Path 2: Consists of 3 steps, each with a 'correct' probability of 0.6.
Which statement accurately compares how these two methods would score the paths?
Evaluating AI Reasoning Strategies
Nuanced Evaluation of Reasoning Paths
Learn After
Application of Log-Probability-Based Reward in RLHF Policy Training
Evaluating Reasoning Path Quality
A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being 'correct':
- Step 1: -0.2
- Step 2: -0.5
- Step 3: -0.1
According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?
Two language models generate different reasoning paths (Path A and Path B) to solve the same complex problem. A reward model assesses each step, providing the log-probability of that step being 'correct'. The total reward for a path is the sum of these log-probabilities.
- Path A log-probabilities: [-0.4, -0.5, -0.3]
- Path B log-probabilities: [-0.1, -2.5, -0.2]
Based on this reward calculation method, which statement accurately compares the two paths?